Introduction

Building off of the first project’s exploratory data analysis findings, this project will aim to produce regression and machine learning modeling. From the findings of the first project, several conclusions were drawn that shaped the scope of this project:

The results of the analysis may have shown little new correlation between variables and SVI themes, but this is still an interesting finding.

1: The means of SVI scores across themes are similar and close in range.

2: Each theme addresses important vulnerabilities.

3: Little new correlation was found.

Thus, this project brings in two additional datasets to examine new correlations and identify if they could be effective predictors in SVI scores.

During the project’s initial phase, limited novel correlations between demographic variables and SVI themes were identified at the census tract level. There were no demographic variables that were left out of SVI themed groupings, thus indicating that the CDC compiled SVI is a well represented and holistic representation of vulnerability assessment. In the second phase of the project, the goal is to utilize the themed and overall SVI scores in combination with additional datasets to see if predictions can be made about census tract or county vulnerability scores. Specifically, correlations between COVID-19 data, median income, and the SVI dataset will be examined. The objective extends to the development of predictive models designed to pinpoint vulnerable communities in events of external stressors (such as a disease outbreak). The temporal scale of the dataset examines SVI scores prior to and after the COVID-19 pandemic.

Research Topic

SMART Questions:

1: Can we explore the presence of correlations between COVID-19 mortality and elevated Social Vulnerability Index (SVI) scores across SVI themes?

2: Is it possible to assess the influence of COVID-19 on SVI scores (before and after the pandemic)?

3: Does median income impact the SVI scores in a significant way?

4: Can we develop predictive models to identify vulnerable communities following external stressors?

5: Which features are imporant/significant in predicting SVI?

The focus of the economic variable dataset use will be to add in another variable reported by the census as a metric of economic status. Then, it will be combined with other variables selected through feature selection to create a logistic regression model. This model will then be used to create a classification tree algorithm that will predict if a tract is Low risk, Medium-Low risk, Medium-High risk, or High risk. By breaking it down into a category with 4 levels, it can help officials, planners, and first responders focus attention to those classified as the highest risk, as well as see the demographic breakdown at a quick glance from the nodes on the tree.

The primary objective of utilizing the COVID dataset is to incorporate supplementary variables associated with COVID-19, encompassing metrics such as cases and deaths, into the Social Vulnerability Index (SVI) dataset. These additional variables are crucial indicators of the pandemic’s extent and impact. Through meticulous feature selection on the merged dataset, significant factors will be identified. Subsequently, these selected features will be employed to construct a robust logistic regression model capable of predicting the SVI score. This analysis holds the potential to assist policymakers, healthcare professionals, and emergency responders in efficiently allocating resources by precisely identifying areas at the highest risk of COVID-19 transmission and its associated challenges.

Data Set and Variables

SVI Data

The SVI is comprised of 5 total SVI calculations: 4 thematic and 1 overall summary composed by the sum of the themes.

It is constructed by selecting the specific indicator variables within different themes that are chosen to represent the various aspects of vulnerability, enabling this project to examine if any themes leave out variable that could be important. Then Census tracts are ranked within each state, as well as against other states, creating tract rankings ranging from 0 to 1, with higher values indicating greater vulnerability. The CDC states: “For each tract, we generated its percentile rank among all tracts for 1) the 16 individual variables, 2) the four themes, and 3) its overall position.”

Then, these percentiles were summed for each of the four themes, and then ordered to determine theme-specific percentile rankings.

Spatial Data

The geographic scale of the data is limited to California census tracts, which allows a detailed analysis of over 9,000 census tracts, hopefully enabling more tailored actions and responses. CA is a state that is prone to natural disasters such as earthquakes, wildfires, and has a very high population, making it an important case study.

Economic Data

The economic data used in this project is Median Income of Households in 2019 acquired from the US Census Bureau for the California Census Tract level.

COVID Data

The COVID-19 data used in this project is acquired from the Roche Data Science Coalition (RDSC) for all 50 states.

Cleaning the Data

Information on cleaning the SVI dataset can be found in Project 1 write-up

econ <- read.csv("ACSDT5Y2020.B19013-Data.csv")
econ <- subset(econ, select = c(GEO_ID, NAME,B19013_001E))
econ <- econ[-c(1, 2), ] 
#rename columns
names_to_change <- c("GEO_ID", "NAME", "B19013_001E")
new_names <- c("GEO_ID", "tract", "income")
econ <- setNames(econ, new_names)
# edit GEO_ID to isolate just the number after 1400000US06001400100
econ$GEO_ID <- sub(".*US0*(\\d+)", "\\1", econ$GEO_ID)
SVI_Data <- read.csv("SVI_2020_US.csv")
us_cases <- read.csv('USAFacts/confirmed-covid-19-cases-in-us-by-state-and-county.csv')
us_deaths <- read.csv('USAFacts/confirmed-covid-19-deaths-in-us-by-state-and-county.csv')

Data Analysis

Economic

Economic Data EDA and Modeling

EDA

Data was cleaned and formatted prior to analysis.

#Import SVI
SVI_Data <- read.csv("SVI_2020_US.csv")
Clean_data <- subset(SVI_Data, select = c(ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,AREA_SQMI,EPL_POV150,    EPL_UNEMP,  EPL_HBURD,  EPL_NOHSDP, EPL_UNINSUR,    SPL_THEME1, RPL_THEME1, EPL_AGE65,  EPL_AGE17,  EPL_DISABL, EPL_SNGPNT, EPL_LIMENG, SPL_THEME2, RPL_THEME2, EPL_MINRTY, SPL_THEME3, RPL_THEME3, E_MINRTY, EP_HISP, EP_ASIAN, EP_AIAN, EPL_MUNIT,    EPL_MOBILE, EPL_CROWD,  EPL_NOVEH,  EPL_GROUPQ, SPL_THEME4, RPL_THEME4, SPL_THEMES, RPL_THEMES, E_AGE65, EP_POV150, EP_AGE65, EP_NOHSDP
) )

CA_SVI <- subset(Clean_data, ST_ABBR == "CA")
CA_SVI <- subset(CA_SVI,  RPL_THEMES!= -999 )
CA_SVI <- subset(CA_SVI,  RPL_THEME1!= -999 )
CA_SVI <- subset(CA_SVI,  RPL_THEME2!= -999 )
CA_SVI <- subset(CA_SVI,  RPL_THEME3!= -999 )
CA_SVI <- subset(CA_SVI,  RPL_THEME4!= -999 )

#Join SVI to econ based on GEO_ID
svi_econ <- merge(econ, CA_SVI, by.x = "GEO_ID", by.y = "FIPS", all.x = TRUE, all.y = TRUE)

#count outliers 
total_na_count <- sum(is.na(svi_econ))
#print(total_na_count)
#remove NA
svi_econ <- na.omit(svi_econ)
svi_econ <- svi_econ[svi_econ$income != "-", , drop = FALSE]
svi_econ$income <- as.numeric(as.character(svi_econ$income))
svi_econ <- na.omit(svi_econ)

Histogram of Median Income

ggplot(svi_econ, aes(x = income)) +
  geom_histogram(binwidth = 1000, fill = "seagreen", color = "palegreen3", alpha = 0.7) +
  labs(title = "Histogram of Median Income of Census Tracts in 2019", x = "Meidan Income", y = "Frequency")

From this plot, we can see that the income data is skewed right, there are more lower median incomes. For the purposes of classifying data, this distribution does not necessarily need to be normally distributed, as long as there are enough variables to represent each target class.

Looking back at our RPL_THEMES, we can observe its distribution as well.

ggplot(svi_econ, aes(x = RPL_THEMES)) +
  geom_histogram(binwidth = 0.01, fill = "tomato1", color = "tomato4", alpha = 0.7) +
  labs(title = "Histogram of SVI Score of Census Tracts in 2020", x = "SVI Score", y = "Frequency")

It is seen that the distribution is skewed left.

Map

In the original project variables were mapped by county. It is possible to map the the income distribution by county to get a visual understanding of its distribution.

map1 <- ggplot(data = econmap) +
  geom_sf(aes(fill = income)) +
  labs(title = "Median Income by Census Tract: 2019",
       fill = "Income in Dollars") +
  scale_fill_viridis_c() +
  theme_void() +
  theme(text = element_text(family = "Avenir"))

map1

It is seen that the majority of census tracts fall in a lower range, with a clustering of higher incomes in the coastal region in the middle of the state.

SVI & Income

Examining income’s effect on the SVI score can be done using linear regression to interpret these effects.

library(broom)
library(knitr)

# Scale or normalize the variables
svi_econ$income_scaled <- svi_econ$income / 100000  # Scale income to be between 0 and 1

model_econ <- lm(RPL_THEMES ~ income_scaled, data = svi_econ)

# Tidy up the results using broom
tidy_results <- tidy(model_econ)

#summary(model_econ)
# Print the formatted table
kable(tidy_results, format = "markdown")
term estimate std.error statistic p.value
(Intercept) 1.0740184 0.0043133 249.0021 0
income_scaled -0.5687045 0.0046053 -123.4893 0

Income_scaled (-0.568704): The estimated change in RPL_THEMES for a one-unit increase in income_scaled (a one unit increase is an increase in $100,000). The coefficient is negative, suggesting a negative relationship between income_scaled and RPL_THEMES.

F-statistic (15250): A test of the overall significance of the model. It compares the fit of the intercept-only model with the fit of the given model. A higher F-statistic and a lower p-value (< 0.05) suggest that at least one variable is significantly related to the dependent variable.

P-value (< 2.2e-16): The p-value associated with the F-statistic is very close to zero, indicating that the overall model is statistically significant.

Conclusion: The results indicate that income (measured by median income of census tracts) is likely a good predictor of SVI risk scores, as it is statistically significant. This answers the SMART question regarding if median income is a statistically significant variable and if it has any significant impact on SVI scores.

Feature Selection

To answer the SMART question regarding which features are significant in predicting and modeling SVI, feature selection algorithms were used. Numerous methodologies were used to select relevant and significant features for the linear regression and classification model. This part of the project will walk through this selection process.

Correlation Matrix

Highly correlated variables:

# Convert selected columns to numeric
svi_select <- mutate_all(svi_econ, as.numeric)

# Drop columns with NA values
svi_select <- svi_select %>%
  select(everything(), -where(~any(is.na(.))))

# Rename multiple columns simultaneously

colnames(svi_select)[colnames(svi_select) == 'RPL_THEMES'] <- 'SVI'
colnames(svi_select)[colnames(svi_select) == 'EPL_POV150'] <- 'Poverty'
colnames(svi_select)[colnames(svi_select) == 'EPL_HBURD'] <- 'Housing_Cost_Burdened'
colnames(svi_select)[colnames(svi_select) == 'EPL_NOHSDP'] <- 'No_Diploma'

# Create the correlation matrix
cor_matrix <- cor(svi_select, use = "complete.obs")

# Variable of interest
target_variable <- "SVI"
# Extract the correlations with the target variable
cor_with_target <- cor_matrix[target_variable, ]
# Select variables with high correlation 
high_correlation_vars <- names(cor_with_target[abs(cor_with_target) > 0.65])

# Print the variables with high correlation
print(high_correlation_vars)
##  [1] "income"                NA                      "Poverty"              
##  [4] "Housing_Cost_Burdened" "No_Diploma"            "EPL_UNINSUR"          
##  [7] "SPL_THEME1"            "RPL_THEME1"            "SPL_THEME2"           
## [10] "RPL_THEME2"            "EP_HISP"               "EPL_CROWD"            
## [13] "SPL_THEME4"            "RPL_THEME4"            "SPL_THEMES"           
## [16] "SVI"                   "EP_POV150"             "EP_NOHSDP"            
## [19] "income_scaled"
# Select the relevant columns for correlation
correlation_matrix1 <- svi_select %>%
  select(SVI, Poverty, Housing_Cost_Burdened, No_Diploma, income_scaled)

# Calculate the correlation matrix
correlation_matrix <- cor(correlation_matrix1)

loadPkg("corrplot")

corrplot(correlation_matrix, method = "square", type = "lower", col = colorRampPalette((c("#2166AC","#FDDBC7","#B2182B")))(100))

This assessment utilized a correlation matrix to identify which variables have a strong relationship with the target variable SVI. The threshold for what counted as a significant relationship was set to 0.65.

From assessing correlation of variables, only 4 variables: income_scaled, EPL_POV150, EPL_HBURD, EPL_NOHSDP had a correlation of 0.65 or higher with the RPL_THEMES variable. This might suggest that these variables are likely predictors of the outcome SVI score, but it is important to assess multicolinearity.

Residual plots:

The residuals plots appear to have a normal distribution. It is also important to assess the variables using VIF assessment to address multicolinearity.

predict_svi <- svi_econ %>%
  select(RPL_THEMES, income_scaled, EPL_POV150, EPL_HBURD, EPL_NOHSDP)

colnames(predict_svi)[colnames(predict_svi) == 'RPL_THEMES'] <- 'SVI'
colnames(predict_svi)[colnames(predict_svi) == 'EPL_POV150'] <- 'Poverty'
colnames(predict_svi)[colnames(predict_svi) == 'EPL_HBURD'] <- 'Housing_Cost_Burdened'
colnames(predict_svi)[colnames(predict_svi) == 'EPL_NOHSDP'] <- 'No_Diploma'



model <- lm(SVI ~ ., data = predict_svi)

# Check the distribution of residuals
residuals <- residuals(model)

# Residual Plot
par(mfrow = c(2, 2))
#plot(model)

# Q-Q Plot
qqnorm(residuals)
qqline(residuals)


# Kernel Density Plot
plot(density(residuals))

VIF Assessment

model_test <- glm(SVI ~ ., data = predict_svi)
summary(model_test)
## 
## Call:
## glm(formula = SVI ~ ., data = predict_svi)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.58346  -0.06916   0.00315   0.07167   0.42826  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.154554   0.010208   15.14   <2e-16 ***
## income_scaled         -0.070985   0.006151  -11.54   <2e-16 ***
## Poverty                0.260760   0.007749   33.65   <2e-16 ***
## Housing_Cost_Burdened  0.221818   0.007789   28.48   <2e-16 ***
## No_Diploma             0.407627   0.005192   78.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01240325)
## 
##     Null deviance: 723.69  on 8955  degrees of freedom
## Residual deviance: 111.02  on 8951  degrees of freedom
## AIC: -13892
## 
## Number of Fisher Scoring iterations: 2
data.frame(vif(model_test))
##                       vif.model_test.
## income_scaled                4.300141
## Poverty                      3.508515
## Housing_Cost_Burdened        3.647747
## No_Diploma                   1.981004

A high VIF value could indicate that there is multicolinearity present. VIF values above 5 are considered moderately high, and values above 10 are recommended to not be used. From this analysis, it is seen that VIF values all below 5. All 4 variables were selected from this process.

Stepwise

The stepwise feature selection was ran on the datatset in both directions.

Initial Model

# Create the initial model
initial_model <- lm(RPL_THEMES ~ income_scaled + EPL_POV150 +EPL_UNINSUR +EPL_AGE65 +EPL_AGE17+ EPL_DISABL+ EPL_SNGPNT +EPL_LIMENG + EPL_MINRTY +EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)

# Perform stepwise selection (both directions)
stepwise_model <- step(initial_model, direction = "both")
## Start:  AIC=-43749.07
## RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + 
##     EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + 
##     EPL_UNEMP + EPL_HBURD + EPL_NOHSDP
## 
##                 Df Sum of Sq    RSS    AIC
## <none>                       67.511 -43749
## - income_scaled  1    0.2624 67.773 -43716
## - EPL_AGE17      1    0.2826 67.793 -43714
## - EPL_MINRTY     1    1.7925 69.303 -43516
## - EPL_AGE65      1    3.3452 70.856 -43318
## - EPL_UNINSUR    1    4.7516 72.262 -43142
## - EPL_UNEMP      1    4.8179 72.329 -43134
## - EPL_LIMENG     1    6.5839 74.095 -42918
## - EPL_HBURD      1    7.2463 74.757 -42838
## - EPL_POV150     1    7.2804 74.791 -42834
## - EPL_DISABL     1    7.3091 74.820 -42830
## - EPL_NOHSDP     1    7.6045 75.115 -42795
## - EPL_SNGPNT     1    8.5891 76.100 -42679

Stepwise Results:

# Display the summary of the selected model
summary(stepwise_model)
## 
## Call:
## lm(formula = RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_UNINSUR + 
##     EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + 
##     EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42698 -0.05646 -0.00013  0.05753  0.29479 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.230164   0.010370 -22.195  < 2e-16 ***
## income_scaled -0.030575   0.005186  -5.896 3.87e-09 ***
## EPL_POV150     0.194878   0.006275  31.055  < 2e-16 ***
## EPL_UNINSUR    0.123601   0.004927  25.088  < 2e-16 ***
## EPL_AGE65      0.100585   0.004778  21.051  < 2e-16 ***
## EPL_AGE17      0.025390   0.004150   6.119 9.83e-10 ***
## EPL_DISABL     0.150926   0.004850  31.116  < 2e-16 ***
## EPL_SNGPNT     0.140333   0.004160  33.731  < 2e-16 ***
## EPL_LIMENG     0.188629   0.006387  29.532  < 2e-16 ***
## EPL_MINRTY     0.131280   0.008519  15.409  < 2e-16 ***
## EPL_UNEMP      0.096957   0.003838  25.263  < 2e-16 ***
## EPL_HBURD      0.195848   0.006321  30.982  < 2e-16 ***
## EPL_NOHSDP     0.188922   0.005952  31.739  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08688 on 8943 degrees of freedom
## Multiple R-squared:  0.9067, Adjusted R-squared:  0.9066 
## F-statistic:  7244 on 12 and 8943 DF,  p-value: < 2.2e-16

Results: All of the coefficients are determined to be significant. The output suggests that the best model, according to AIC, includes all the features. Removing any single feature did not lead to a significant improvement in AIC. The model seems to be highly significant, and the included variables all have statistically significant coefficients. The model also explains a substantial portion of the variability in the dependent variable (RPL_THEMES), as indicated by the high R-squared value of 0.9067.

LASSO

[Lasso analysis] (https://www.statisticshowto.com/lasso-regression/) is used for variable selection and it introduces a penalty term to the traditional linear regression objective function. It does this by giving a penalty to the less important features, making their influence smaller or even zero to filter out the noise and focus on the most crucial things that affect the target variable.

selected_columns <- c("RPL_THEMES", "income_scaled", "EPL_POV150", "EPL_UNINSUR", 
                       "EPL_AGE65", "EPL_AGE17", "EPL_DISABL", 
                       "EPL_SNGPNT", "EPL_LIMENG", "EPL_MINRTY", 
                       "EPL_UNEMP", "EPL_HBURD", "EPL_NOHSDP")

# Filter the data frame
trim_svi <- svi_econ %>%
  select(all_of(selected_columns))

# Prepare data
x <- model.matrix(RPL_THEMES ~ . - 1, data = trim_svi)
y <- svi_econ$RPL_THEMES

# Fit LASSO regression model
fit_lasso <- cv.glmnet(x, y, alpha = 1)

# Display coefficients of the LASSO model
coef_lasso <- coef(fit_lasso, s = fit_lasso$lambda.min)
print(coef_lasso)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)   -0.21752984
## income_scaled -0.03103186
## EPL_POV150     0.19490422
## EPL_UNINSUR    0.12096308
## EPL_AGE65      0.09326387
## EPL_AGE17      0.02160010
## EPL_DISABL     0.15030534
## EPL_SNGPNT     0.13961008
## EPL_LIMENG     0.18789132
## EPL_MINRTY     0.12494756
## EPL_UNEMP      0.09555821
## EPL_HBURD      0.19381535
## EPL_NOHSDP     0.19188624
# Fit Ridge regression model
#fit_ridge <- cv.glmnet(x, y, alpha = 0)

# Display coefficients of the Ridge model
#coef_ridge <- coef(fit_ridge, s = fit_ridge$lambda.min)
#print(coef_ridge)

#optimal_lambda <- fit_lasso$lambda.min
#coefficients_at_optimal_lambda <- coef(fit_lasso, s = optimal_lambda)
#print(coefficients_at_optimal_lambda)

Results: All of the coefficients turn out to be non zero, but some are slightly lower than the others. However, due to the nature of this analysis, the coefficients are deemed meaningful if they are non-zero values. Interpretation of the coefficients can be understood as:

Income: The coefficient is -0.217, suggesting a one-unit increase in income is associated with a decrease of approximately -0.217 units in the response variable, holding other variables constant.

EPL_POV150: The coefficient is 0.1949, suggesting a one-unit increase in EPL_POV150 is associated with an increase of approximately 0.1949 units in the response variable, holding other variables constant.

Out of all of the variables, it appears EPL_AGE65 (0.09326387), EPL_AGE17 (0.02160010), and EPL_UNEMP (0.09555821) all have the smallest impact on overall SVI, but are still significant in the model.

Therefore, it appears that LASSO and Stepwise both selected the full model as the best model. In the next steps, a comparison of the simpler linear regression model and the full regression model will be carried out.

This answers the SMART question regarding which variables are significant in modeling and predictign SVI.

Linear Regression

We begin the classification process by examining the relationships of these variables in a linear regression to identify which model is best to use out of the two identified models from feature selection. This is done by examining model preformance and characteristics.

Simple Model

The simple model represents the linear regression modeled by: RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_HBURD + EPL_NOHSDP, as identified by the correlation matrix.

simple_model <- lm(RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
summary(simple_model)
## 
## Call:
## lm(formula = RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_HBURD + 
##     EPL_NOHSDP, data = svi_econ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58346 -0.06916  0.00315  0.07167  0.42826 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.154554   0.010208   15.14   <2e-16 ***
## income_scaled -0.070985   0.006151  -11.54   <2e-16 ***
## EPL_POV150     0.260760   0.007749   33.65   <2e-16 ***
## EPL_HBURD      0.221818   0.007789   28.48   <2e-16 ***
## EPL_NOHSDP     0.407627   0.005192   78.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1114 on 8951 degrees of freedom
## Multiple R-squared:  0.8466, Adjusted R-squared:  0.8465 
## F-statistic: 1.235e+04 on 4 and 8951 DF,  p-value: < 2.2e-16
full_model <- lm(RPL_THEMES ~ income_scaled + EPL_POV150 +EPL_UNINSUR +EPL_AGE65 +EPL_AGE17+ EPL_DISABL+ EPL_SNGPNT +EPL_LIMENG + EPL_MINRTY +EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)

In this model, all of the variables are significant with low p-values. This model has an R-squared of 0.846, suggesting a moderate fit.

Full Model

The full model represents the model utilizing all of the variables.

summary(full_model)
## 
## Call:
## lm(formula = RPL_THEMES ~ income_scaled + EPL_POV150 + EPL_UNINSUR + 
##     EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + 
##     EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = svi_econ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42698 -0.05646 -0.00013  0.05753  0.29479 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.230164   0.010370 -22.195  < 2e-16 ***
## income_scaled -0.030575   0.005186  -5.896 3.87e-09 ***
## EPL_POV150     0.194878   0.006275  31.055  < 2e-16 ***
## EPL_UNINSUR    0.123601   0.004927  25.088  < 2e-16 ***
## EPL_AGE65      0.100585   0.004778  21.051  < 2e-16 ***
## EPL_AGE17      0.025390   0.004150   6.119 9.83e-10 ***
## EPL_DISABL     0.150926   0.004850  31.116  < 2e-16 ***
## EPL_SNGPNT     0.140333   0.004160  33.731  < 2e-16 ***
## EPL_LIMENG     0.188629   0.006387  29.532  < 2e-16 ***
## EPL_MINRTY     0.131280   0.008519  15.409  < 2e-16 ***
## EPL_UNEMP      0.096957   0.003838  25.263  < 2e-16 ***
## EPL_HBURD      0.195848   0.006321  30.982  < 2e-16 ***
## EPL_NOHSDP     0.188922   0.005952  31.739  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08688 on 8943 degrees of freedom
## Multiple R-squared:  0.9067, Adjusted R-squared:  0.9066 
## F-statistic:  7244 on 12 and 8943 DF,  p-value: < 2.2e-16

In this model, all of the variables are significant with low p-values. This model has an R-squared of 0.9067, suggesting a better fit than the simple model, with a good fit.

Assessing AIC
# Assessment for initial_model
#summary(simple_model)
#plot(simple_model)

# Assessment for initial_model2
#summary(full_model)
#plot(full_model)

# Calculate AIC for initial_model
aic_simple_model <- AIC(simple_model)

# Calculate AIC for initial_model2
aic_full_model <- AIC(full_model)

# Compare AIC values
#cat("AIC for simple_model:", aic_simple_model, "\n")
#cat("AIC for full_model:", aic_full_model, "\n")

Assessing the AIC values of both models, it is seen that the AIC for the simple model is -13891.99 while the AIC for the full model is -18331.04. Therefore, it is recommended to use the model with the lower AIC, as it is the better model.

R-Squared
# Calculate Adjusted R-squared for initial_model
adj_rsq_simple_model <- summary(simple_model)$adj.r.squared

# Calculate Adjusted R-squared for initial_model2
adj_rsq_full_model2 <- summary(full_model)$adj.r.squared

# Compare Adjusted R-squared values
cat("Adjusted R-squared for simple:", adj_rsq_simple_model, "\n")
cat("Adjusted R-squared for full:", adj_rsq_full_model2, "\n")

Adjusted R-squared for simple model: 0.8465219

Adjusted R-squared for full model: 0.9065885

The full model is more effective at explaining the variation in the SVI scores, and thus may have a higher predictive capability.

VIFS
# Calculate VIF for initial_model
vif_simple_model <- car::vif(simple_model)

# Calculate VIF for initial_model2
vif_full_model <- car::vif(full_model)

# Compare VIF values
cat("VIF for simple_model:", vif_simple_model, "\n")
cat("VIF for full_model:", vif_full_model, "\n")

Simple Model VIF: 4.30, 3.51, 3.64, 1.98

Full Model VIF: 5.02, 3.78, 2.14, 2.23, 1.80, 1.77, 1.71, 2.73, 3.34, 1.25, 3.95, 4.23

Both simple and full models have VIF values under 10, and only one value (income) in the full model is slightly above 5. This is still within the range of acceptable values, therefore it is appropriate to use the full model.

Predictive Modeling
Quartile Model

As a result of the feature selection process, it was determined that the full model is appropriate and the best fitting model to use. Therefore, the variables from this full model will be used to analyze the question of it is possible to classifify census tracts into levels of SVI risk.

From the CDC website, the recommended levels of risk to model are:

“In the CDC/ATSDR SVI Interactive Map, we classify data using quartiles (0 to .2500, .2501 to .5000, .5001 to .7500, .7501 to 1.0) and indicate that the classification goes from least vulnerable to most vulnerable.”

The initial data processing required to classify the census tracts is to assign each tract a level of low, medium-low, medium-high, and high based on the quartile breaks. Below is a plot of the census tracts broken into quartiles.

# Create quartiles and labels
trim_svi$risk <- cut(trim_svi$RPL_THEMES, breaks = c(0, 0.25, 0.5, 0.75, 1.0), labels = c("low", "lowmed", "medhigh", "high"))

# Plot the data with quartile labels
library(ggplot2)

ggplot(trim_svi, aes(x = risk, fill = risk)) +
  geom_bar() +
  labs(title = "Bar Plot of SVI Quartiles", x = "SVI Risk", y = "Count") +
  scale_fill_manual(values = c("low" = "#1a9641", "lowmed" = "#a6d96a", "medhigh" = "#fdae61", "high" = "#d7191c"))+
  scale_x_discrete(labels = c("low" = "Low", "lowmed" = "Low-Medium", "medhigh" = "Medium-High", "high" = "High"))+
  theme_minimal() +
  theme(legend.position = "none", text = element_text(family = "Avenir"))

From this plot, it is seen that in California there are more medium-high and high risk census tracts than low and medium-low risk counties. This may cause issues in classifying the data due to the fact that there is an imbalance of data in each target class, with a bias towards more high risk tracts.

Classification Trees

Classification trees with a max-depth of 4 were used to classify the census tracts into the quartile rankings. While this may not be the best fitting size of the tree, anything beyond 4 levels may become difficult to interpret, while the aim of the project is to create an easily interpret-able figure for first responders or government officials. Interpret-ability of decision trees is a plus, but it may not be the most effective or comprehensive algorithm.

set.seed(1)
svi_econ_tree <- rpart(risk ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP , data=trim_svi, method="class", control = list(maxdepth = 4) )

#summary(svi_econ_tree)
#plot(svi_econ_tree, uniform=TRUE, main="Classification Tree for svi_econ_tree")
#text(svi_econ_tree, use.n=TRUE, all=TRUE, cex=.8)

This figure depicts the decision tree used to classify census tracts into the rankings. From this tree, it is possible to see which variables are used to determine the classification of the tract and at what value. These variables are: EPL_POV150 (Population in poverty over 150%), EPL_NOHSDP (No high school diploma), income_scaled (Median income), EPL_HBURD (Housing burdened).

fancyRpartPlot(svi_econ_tree)

Below is a plot of the Cross-Validation Relative Error of the tree. Relative error is a metric used to assess the predictive performance of a model, which measures the difference between predicted values and actual values relative to the actual values. The x-axis labeled “cp” refers to the complexity parameter. A smaller value of “cp” indicates more aggressive pruning, resulting in a smaller and simpler tree, which is a trade-off with higher predictive accuracy. The second x-axis is the size of the tree, referring to how many splits there are.

This is an important graph to analyze, as when a decision tree is grown, it may become too overfit, capturing noise instead of real relationships. The goal is to find a “cp” that results in a well-balanced tree, providing good predictive performance on both the training and validation data. This tree’s relative error ranks from 0.4 to 1, which implies even the biggest tree still has a relative error of 40%.

#printcp(svi_econ_tree) 
plotcp(svi_econ_tree) 

Confusion Matrix: Below is a confusion matrix using the above classification tree. From initial observations, it is clear that this model is not accurately classifying the census tracts into quartile risk classes.

# Generate predictions on the training data
train_predictions <- predict(svi_econ_tree, trim_svi, type = "class")

# Create a confusion matrix
conf_matrix_train <- table(train_predictions, trim_svi$risk)


xkabledply(conf_matrix_train, "confusion matrix")
confusion matrix
low lowmed medhigh high
low 1004 377 38 3
lowmed 403 1104 503 23
medhigh 46 420 1153 393
high 2 32 537 2918

From the confusion matrix, it is possible to gain information about the accuracies of this classification model. For this analysis, the algorithm is classifying a categorical variable into 4 levels. Therefore, the table below depicts the overall accuracy, and then the precision and recall by class.

# Convert to confusion matrix object
conf_matrix <- as.table(conf_matrix_train)

# Print the confusion matrix
# Calculate metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- diag(conf_matrix) / rowSums(conf_matrix)
recall <- diag(conf_matrix) / colSums(conf_matrix)
f1_score <- 2 * (precision * recall) / (precision + recall)

# Print metrics
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.6899285
cat("Precision (by class):", precision, "\n")
## Precision (by class): 0.7060478 0.5430398 0.5730616 0.8363428
cat("Recall (by class):", recall, "\n")
## Recall (by class): 0.6900344 0.571133 0.5168086 0.8744381
#cat("F1 Score (by class):", f1_score, "\n")

As seen in the results, the overall accuracy is fairly low, at 68.9%. The decision tree is better at classifying levels of high risk. This is likely due to the fact that there are more census tracts that are higher risk, due to imbalanced data. Additionally, the model might not be as accurate in picking up the differences between low and low-medium and medium and medium-high, as these might be small variations in scores and variable values.

Below is a table of the evaluation statistics for the model.

loadPkg("rpart")
loadPkg("caret")

# create an empty dataframe to store the results from confusion matrices
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )

for (deep in 2:6) {
   kfit  <- rpart(risk ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP , data=trim_svi, method="class", control = list(maxdepth = deep) )
  # 
  cm = confusionMatrix( predict(kfit, type = "class"), reference = trim_svi[, "risk"] ) # from caret library
  # 
  cmaccu = cm$overall['Accuracy']
  # print( paste("Total Accuracy = ", cmaccu ) )
  # 
  cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
  confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
  # print("Other metrics : ")
}

unloadPkg("caret")

xkabledply(confusionMatrixResultDf, title="SVI econ Classification Trees summary with varying MaxDepth")
SVI econ Classification Trees summary with varying MaxDepth
Depth Accuracy Class..low Class..lowmed Class..medhigh Class..high
Sensitivity 2 0.6175 0.8570 0.0000 0.7571 0.7773
Specificity 2 0.6175 0.8400 1.0000 0.7206 0.9382
Pos Pred Value 2 0.6175 0.5096 NaN 0.4734 0.8820
Neg Pred Value 2 0.6175 0.9680 0.7842 0.8994 0.8765
Precision 2 0.6175 0.5096 NA 0.4734 0.8820
Recall 2 0.6175 0.8570 0.0000 0.7571 0.7773
F1 2 0.6175 0.6392 NA 0.5825 0.8264
Prevalence 2 0.6175 0.1625 0.2158 0.2491 0.3726
Detection Rate 2 0.6175 0.1392 0.0000 0.1886 0.2896
Detection Prevalence 2 0.6175 0.2732 0.0000 0.3984 0.3284
Balanced Accuracy 2 0.6175 0.8485 0.5000 0.7388 0.8578
Sensitivity1 3 0.6663 0.4584 0.6958 0.6100 0.7773
Specificity1 3 0.6663 0.9809 0.8149 0.8217 0.9382
Pos Pred Value1 3 0.6663 0.8235 0.5085 0.5316 0.8820
Neg Pred Value1 3 0.6663 0.9033 0.9068 0.8640 0.8765
Precision1 3 0.6663 0.8235 0.5085 0.5316 0.8820
Recall1 3 0.6663 0.4584 0.6958 0.6100 0.7773
F11 3 0.6663 0.5890 0.5876 0.5681 0.8264
Prevalence1 3 0.6663 0.1625 0.2158 0.2491 0.3726
Detection Rate1 3 0.6663 0.0745 0.1502 0.1520 0.2896
Detection Prevalence1 3 0.6663 0.0904 0.2953 0.2858 0.3284
Balanced Accuracy1 3 0.6663 0.7197 0.7554 0.7159 0.8578
Sensitivity2 4 0.6899 0.6900 0.5711 0.5168 0.8744
Specificity2 4 0.6899 0.9443 0.8677 0.8723 0.8984
Pos Pred Value2 4 0.6899 0.7060 0.5430 0.5731 0.8363
Neg Pred Value2 4 0.6899 0.9401 0.8803 0.8448 0.9234
Precision2 4 0.6899 0.7060 0.5430 0.5731 0.8363
Recall2 4 0.6899 0.6900 0.5711 0.5168 0.8744
F12 4 0.6899 0.6979 0.5567 0.5435 0.8550
Prevalence2 4 0.6899 0.1625 0.2158 0.2491 0.3726
Detection Rate2 4 0.6899 0.1121 0.1233 0.1287 0.3258
Detection Prevalence2 4 0.6899 0.1588 0.2270 0.2247 0.3896
Balanced Accuracy2 4 0.6899 0.8172 0.7194 0.6945 0.8864
Sensitivity3 5 0.6899 0.6900 0.5711 0.5168 0.8744
Specificity3 5 0.6899 0.9443 0.8677 0.8723 0.8984
Pos Pred Value3 5 0.6899 0.7060 0.5430 0.5731 0.8363
Neg Pred Value3 5 0.6899 0.9401 0.8803 0.8448 0.9234
Precision3 5 0.6899 0.7060 0.5430 0.5731 0.8363
Recall3 5 0.6899 0.6900 0.5711 0.5168 0.8744
F13 5 0.6899 0.6979 0.5567 0.5435 0.8550
Prevalence3 5 0.6899 0.1625 0.2158 0.2491 0.3726
Detection Rate3 5 0.6899 0.1121 0.1233 0.1287 0.3258
Detection Prevalence3 5 0.6899 0.1588 0.2270 0.2247 0.3896
Balanced Accuracy3 5 0.6899 0.8172 0.7194 0.6945 0.8864
Sensitivity4 6 0.6899 0.6900 0.5711 0.5168 0.8744
Specificity4 6 0.6899 0.9443 0.8677 0.8723 0.8984
Pos Pred Value4 6 0.6899 0.7060 0.5430 0.5731 0.8363
Neg Pred Value4 6 0.6899 0.9401 0.8803 0.8448 0.9234
Precision4 6 0.6899 0.7060 0.5430 0.5731 0.8363
Recall4 6 0.6899 0.6900 0.5711 0.5168 0.8744
F14 6 0.6899 0.6979 0.5567 0.5435 0.8550
Prevalence4 6 0.6899 0.1625 0.2158 0.2491 0.3726
Detection Rate4 6 0.6899 0.1121 0.1233 0.1287 0.3258
Detection Prevalence4 6 0.6899 0.1588 0.2270 0.2247 0.3896
Balanced Accuracy4 6 0.6899 0.8172 0.7194 0.6945 0.8864
Binary Model

Instead of classifying into 4 levels of risk, splitting the census tracts into low versus high risk could potentially increase accuracy. Therefore, the final model that could be provided to emergency responders or government officials would provide a high level overview of what sort of variables and values would cause a census tracts to have high risk or low risk.

While the data is still imbalanced, the model might have a better accuracy by telling apart this coarser distinction rather than the small differences in values.

# Create binary classification (low vs high) based on the provided breaks
trim_svi$risk_binary <- cut(trim_svi$RPL_THEMES, breaks = c(0, 0.5, 1), labels = c("low", "high"))

ggplot(trim_svi, aes(x = risk_binary, fill = risk_binary)) +
  geom_bar() +
  labs(title = "Bar Plot of SVI Risk (Binary)", x = "SVI Risk", y = "Count") +
  scale_fill_manual(values = c("low" = "#1a9641", "high" = "#d7191c")) +
  scale_x_discrete(labels = c("low" = "Low (0-0.50)", "high" = "High (0.50-1)")) +
  theme_minimal() +
  theme(legend.position = "none", text = element_text(family = "Avenir"))

# Decision Tree Modeling
set.seed(1)
svi_econ_tree <- rpart(risk_binary ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = trim_svi, method = "class", control = list(maxdepth = 4))
printcp(svi_econ_tree)  # display the results 
## 
## Classification tree:
## rpart(formula = risk_binary ~ income_scaled + EPL_POV150 + EPL_UNINSUR + 
##     EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + 
##     EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP, data = trim_svi, 
##     method = "class", control = list(maxdepth = 4))
## 
## Variables actually used in tree construction:
## [1] EPL_NOHSDP    EPL_POV150    income_scaled
## 
## Root node error: 3388/8956 = 0.37829
## 
## n= 8956 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.598288      0   1.00000 1.00000 0.0135463
## 2 0.033353      1   0.40171 0.41765 0.0101881
## 3 0.028335      3   0.33501 0.37131 0.0097057
## 4 0.022432      4   0.30667 0.34002 0.0093516
## 5 0.010000      5   0.28424 0.30579 0.0089339

Below is a confusion matrix of the values.

# Confusion Matrix
train_predictions <- predict(svi_econ_tree, trim_svi, type = "class")
conf_matrix_train <- table(train_predictions, trim_svi$risk_binary)

# Display the confusion matrix
xkabledply(conf_matrix_train, "confusion matrix")
confusion matrix
low high
low 2920 495
high 468 5073
# Convert to confusion matrix object
conf_matrix <- as.table(conf_matrix_train)

From the confusion matrix, the overall accuracy, precision for each class, and recall for each class are shown below.

# True Positives, False Positives, False Negatives, True Negatives
tp <- conf_matrix_train["high", "high"]
fp <- conf_matrix_train["low", "high"]
fn <- conf_matrix_train["high", "low"]
tn <- conf_matrix_train["low", "low"]

# Accuracy
accuracy <- (tp + tn) / sum(conf_matrix_train)
cat("Overall Accuracy:", accuracy, "\n")

# Precision for "high"
precision_high <- tp / (tp + fp)
cat("Precision for 'high':", precision_high, "\n")

# Recall for "high"
recall_high <- tp / (tp + fn)
cat("Recall for 'high':", recall_high, "\n")

# Precision for "low"
precision_low <- tn / (tn + fn)
cat("Precision for 'low':", precision_low, "\n")

# Recall for "low"
recall_low <- tn / (tn + fp)
cat("Recall for 'low':", recall_low, "\n")

From this analysis, it is seen that this model with the binary classification has an overall accuracy of 89.24%. This is a higher overall accuracy compared to the previous model scoring 68.9%. The individual precision and recall values for this model are below:

  • Precision for ‘high’: 0.911

  • Recall for ‘high’: 0.9155

  • Precision for ‘low’: 0.8619

  • Recall for ‘low’: 0.8550

These individual measures of accuracies are higher overall compared to the other classification. This algorithm does a better job at accurately predicting census tract risk scores, and avoids FP and FN more effectively. Overall, this model is still more effective at classifying high risk, probably still due to the imbalanced data.

Using this binary classification, a classification tree can be provided to identify low versus high risk tracts, with clearly identifiable variables and values that determine the classification decisions. This deliverable, or an example of one, can be seen below.

set.seed(1)
svi_econ_tree <- rpart(risk_binary ~ income_scaled + EPL_POV150 + EPL_UNINSUR + EPL_AGE65 + EPL_AGE17 + EPL_DISABL + EPL_SNGPNT + EPL_LIMENG + EPL_MINRTY + EPL_UNEMP + EPL_HBURD + EPL_NOHSDP , data=trim_svi, method="class", control = list(maxdepth = 4) )

fancyRpartPlot(svi_econ_tree)

Conclusion: From this decision tree, it can be sugested that the variables planners/officials could look at to determine high versus low risk are EPL_POV150 (poverty rate above 150%), EPL_NOHSDP (no high school diploma), and income_scaled (median income). This is an interesting finding that provides further insights as to which variables are significant or meaningful in modeling the SVI risk of census tracts, building off of the correlations identified project 1 and contributing the other findings and models analyzed in this project. It can also be concluded that the full model is the more representative model, and it does a better job at classifying census tract risk. Additionally, the model does a better job at classifying high vs low risk, rather than the four level risk classification. This suggests that to examine more fine scale risk, it would be necessary to introduce more data, either in bootstrapping and resampling, or fine tuning parameters of the models to create a better and more accurate model.

COVID-19

Covid Data EDA and Modeling

Data Cleaning

Data cleaning is an essential step in data analysis to ensure accuracy and reliability. The given R code snippets demonstrate key data cleaning procedures such as column subsetting and renaming, missing values removal, formatting, and row subsetting.

Column Subsetting and Renaming

This process involves selecting specific columns from a dataset and renaming them for clarity and consistency. In the provided code, the subset function is used to select relevant columns from the SVI_Data dataset, creating a new dataframe SVI. Further, the columns ‘STCNTY’ and ‘us_cases’ were renamed for clarity

SVI <- subset(SVI_Data, select = c(STATE,STCNTY,COUNTY,FIPS,EP_POV150,EP_LIMENG,EP_AGE65, EP_MINRTY,    EP_UNEMP,   EP_NOHSDP,EP_AGE17,EP_DISABL,EP_SNGPNT,EP_MUNIT,EP_MOBILE,EP_CROWD,EP_NOVEH,EP_GROUPQ,RPL_THEMES) )

# Renaming columns
colnames(SVI)[colnames(SVI) == 'STCNTY'] <- 'county_fips'

us_cases <- rename(us_cases, confirmed_cases = confirmed)

Missing values removal

Dealing with missing or invalid data is crucial. The code filters out rows in the SVI dataframe ensuring removal of placeholder or erroneous entries. Additionally, the na.omit function removes rows with any missing values (NA). This step is also applied to other datasets (us_cases and us_deaths) to identify rows with missing data in specific columns.

# Filter out rows with any value equal to -999
SVI <- SVI[rowSums(SVI == -999, na.rm = TRUE) == 0, ]

# Drop rows with missing values
SVI <- na.omit(SVI)

us_cases[is.na(us_cases$confirmed_cases), ]
##  [1] X               county_fips     county_name     state_name     
##  [5] state_fips      date            confirmed_cases lat            
##  [9] long            geometry       
## <0 rows> (or 0-length row.names)
us_deaths[is.na(us_deaths$deaths), ]
##  [1] X           county_fips county_name state_name  state_fips  date       
##  [7] deaths      lat         long        geometry   
## <0 rows> (or 0-length row.names)

Formatting

Data often requires conversion into a usable format. In this step, the ‘date’ columns in us_cases and us_deaths datasets are converted to POSIXct datetime objects using the as.POSIXct function. This conversion, dictated by a predefined format string (format_str), is essential for time-series analysis or operations that require date-time formatted data.

format_str <- '%Y-%m-%d'

# Convert the 'date' column to datetime and create a new 'datetime' column
us_cases$datetime <- as.POSIXct(us_cases$date, format = format_str)
us_deaths$datetime <- as.POSIXct(us_deaths$date, format = format_str)

Row Subsetting

This step involves selecting specific rows based on certain conditions. For the SVI dataset, rows pertaining to the state of California are extracted into CA_SVI. In us_cases and us_deaths, rows labeled as ‘Statewide Unallocated’ are excluded to focus on specific counties. Additionally, subsets of the us_cases and us_deaths datasets are created for the state of California (‘CA’), allowing for a more focused analysis on this region.

CA_SVI <- subset(SVI, STATE == "California")
# Drop rows that do not refer to a specific county ('Statewide Unallocated')
us_cases <- us_cases[us_cases$county_name != 'Statewide Unallocated', ]
us_deaths <- us_deaths[us_deaths$county_name != 'Statewide Unallocated', ]


us_cases_ca <- subset(us_cases, state_name == "CA")
us_deaths_ca <- subset(us_deaths, state_name == "CA")
EDA

The below EDA techniques focuses on creating maps and timeline for COVID-19 in California.

Look at the Data

Before we procede to the EDA lets look at our data

The COVID-19 confirmed cases dataset:

xkabledplyhead(us_cases_ca)
Head
X county_fips county_name state_name state_fips date confirmed_cases lat long geometry datetime
36097 36096 6000 Grand Princess Cruise Ship CA 6 2020-01-22 0 NA NA 2020-01-22
36098 36097 6000 Grand Princess Cruise Ship CA 6 2020-01-23 0 NA NA 2020-01-23
36099 36098 6000 Grand Princess Cruise Ship CA 6 2020-01-24 0 NA NA 2020-01-24
36100 36099 6000 Grand Princess Cruise Ship CA 6 2020-01-25 0 NA NA 2020-01-25
36101 36100 6000 Grand Princess Cruise Ship CA 6 2020-01-26 0 NA NA 2020-01-26

The COVID-19 deaths dataset:

xkabledplyhead(us_deaths_ca)
Head
X county_fips county_name state_name state_fips date deaths lat long geometry datetime
36097 36096 6000 Grand Princess Cruise Ship CA 6 2020-01-22 0 NA NA 2020-01-22
36098 36097 6000 Grand Princess Cruise Ship CA 6 2020-01-23 0 NA NA 2020-01-23
36099 36098 6000 Grand Princess Cruise Ship CA 6 2020-01-24 0 NA NA 2020-01-24
36100 36099 6000 Grand Princess Cruise Ship CA 6 2020-01-25 0 NA NA 2020-01-25
36101 36100 6000 Grand Princess Cruise Ship CA 6 2020-01-26 0 NA NA 2020-01-26

Data Aggregation and Summarization:

We start with aggregating and summarizing confirmed cases and deaths by county. The data is then grouped by county_fips and county_name within the state of California (us_cases_ca and us_deaths_ca). Summations of confirmed cases and deaths are calculated using the summarise function. The arrange function sorts these summaries in descending order, highlighting areas with the highest impact. This step is crucial for understanding the distribution and impact of COVID-19 at the county level.

Map #1

Map of CA colored by confirmed cases

For confirmed cases, a choropleth map of California (plot_usmap) is generated, where each county is colored based on the number of confirmed COVID-19 cases (Summarized_cases). The map is enhanced with a color gradient (white to dark blue), making it easy to identify areas with higher numbers of cases.

# Plot the choropleth map
plot_usmap(
  data = Summarized_cases,
  values = "confirmed",
  include = c("CA"),
  labels = TRUE
) + 
scale_fill_gradient(
  low = "white", high = "darkblue", name = "Confirmed Cases", label = scales::comma) + 
labs(title = "COVID-19 Confirmed Cases in California Counties") +
theme(legend.position = "right")

Observation:

The darkest shade of blue, which suggests the highest number of confirmed cases, appears concentrated in one specific area (Los Angeles). This suggests that there is a significant outbreak or a larger population in this county, which could potentially be one of the more urban areas with higher population density.

The gradient scale on the right indicates that the highest number of confirmed cases in a single county goes up to 6,000,000, which is a substantial figure. This suggests a high transmission rate or a prolonged presence of the virus in the most affected areas.

Most counties appear to have a much lower number of cases, as indicated by the light blue or white color, showing a significant disparity in case distribution across the state.

Map #2

Map of CA colored by deaths

For number of deaths, a choropleth map of California (plot_usmap) is generated, where each county is colored based on the number of COVID-19 deaths (Summarized_deaths). The map is enhanced with a color gradient (white to dark blue), making it easy to identify areas with higher numbers of deaths.

# Plot the choropleth map
plot_usmap(
  data = Summarized_deaths,
  values = "deaths",
  include = c("CA"),
  labels = TRUE
) + 
scale_fill_gradient(
  low = "white", high = "darkblue", name = "Number of Deaths", label = scales::comma) + 
labs(title = "COVID-19 Deaths in California Counties") +
theme(legend.position = "right")

Observation:

The second map represents the number of deaths attributed to COVID-19. Again, the same county (LA) that had the highest number of confirmed cases also has the highest number of deaths, which is consistent with expectations.

The scale for deaths is significantly lower than that for confirmed cases, topping out at 250,000 deaths in the most affected county. This disparity between the number of cases and deaths may be due to various factors, including the healthcare system’s capacity, the demographic profile of the affected population, and the measures taken to treat and prevent the spread of the virus.

Similar to the confirmed cases, the rest of the counties show fewer deaths, indicating a lesser impact or effective management and response to the pandemic in these regions.

Timeline for Confirmed cases

Here we perform data aggregation and summarization tasks on two separate datasets: one containing COVID-19 data(confirmed cases) and the other containing vulnerability indicators (referred to as SVI).

For the COVID-19 data, the code groups the data by county FIPS code, county name, and state name to organize the cases geographically. It then calculates the total number of confirmed COVID-19 cases for each county and sorts these totals in descending order, so that counties with the most cases and deaths appear first.

For the SVI data, the code similarly groups the data by geographical identifiers but goes on to calculate the average for a range of vulnerability indicators within each county. These indicators include measures of poverty, linguistic isolation, the proportion of the elderly population, minority status, unemployment rates, education levels, and others that together create a profile of each county’s social vulnerability.

head(svi_all_counties_cases)
##   county_fips         COUNTY      STATE EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY
## 1        6037    Los Angeles California  23.87737 12.828138 13.89510  72.85186
## 2        6065      Riverside California  23.09419  8.239341 16.69419  63.61318
## 3        6059         Orange California  16.28873  8.660458 15.87712  57.10507
## 4        6073      San Diego California  18.12154  6.174760 14.85514  53.03333
## 5        6071 San Bernardino California  25.56299  7.435714 12.62857  69.68442
## 6        6001        Alameda California  15.24695  7.488064 14.21671  67.70398
##   EP_UNEMP EP_NOHSDP EP_AGE17 EP_DISABL EP_SNGPNT  EP_MUNIT EP_MOBILE  EP_CROWD
## 1 6.473968  20.79826 21.18955 10.177733  6.566640 24.788340  1.696275 12.440850
## 2 7.513178  18.21647 23.74128 12.406589  6.296705  7.409884  8.743023  7.675194
## 3 4.943954  13.27876 21.45654  9.089052  5.101307 17.061765  2.833497  9.299020
## 4 6.215501  11.81687 20.85789 10.109602  5.670508 20.062963  3.522908  7.106996
## 5 7.477273  19.43745 26.06212 11.952597  8.187013  9.546537  5.960606  9.195238
## 6 4.830769  11.06711 19.63873  9.510345  5.194430 22.045093  1.136870  8.306101
##   EP_NOVEH EP_GROUPQ RPL_THEMES confirmed_cases
## 1 8.950891  1.643563  0.6599071         7627646
## 2 4.189535  1.295349  0.6282279         1314109
## 3 4.304248  1.171405  0.5076575         1139456
## 4 5.362414  1.894650  0.5263907         1112369
## 5 4.872727  1.179654  0.6663433          942421
## 6 9.815385  2.041379  0.5008411          466773

Above table shows the top counties with the highest COVID-19 cases and their mean SVI

Create timeline of covid confirmed cases

The aggreagated data created before is used to visualize the timeline of COVID-19 cases in different counties in CA

# Which are the 10 most affected US counties?
ca_cases_top_10 <- head(us_cases_all_counties, 10)

# Preparing a list with the 10 most affected counties:
top_10_cases_fips_list <- ca_cases_top_10$county_fips

ca_cases_top_10_datetime <- subset(us_cases_ca, county_fips %in% top_10_cases_fips_list)

# Plot Timeline of cases
ca_cases_top_10_datetime$datetime <- as.Date(ca_cases_top_10_datetime$datetime)  # Make sure datetime is in Date format

f <- ggplot(ca_cases_top_10_datetime, aes(x = datetime, y = confirmed_cases, color = county_name)) +
  geom_line() +
  labs(x = 'Timeline', y = 'Confirmed Cases', title = 'Confirmed cases in CA counties') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_color_discrete(name = 'County Name')+
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")   # Format the x-axis with month and year


print(f)

Observation

Los Angeles County stands out with a pronounced exponential growth curve, indicating a rapid increase in confirmed cases, surpassing 150,000 by August 2020. Other counties such as San Diego, Orange, and Riverside also show a rise in cases, but none as dramatic as Los Angeles County. This graph emphasizes the wide variation in case numbers between counties, with some experiencing relatively moderate increases while one, in particular, sees a far more aggressive spread of the virus.

Timeline for Number of Deaths

Here we perform data aggregation and summarization tasks on two separate datasets: one containing COVID-19 data(confirmed cases) and the other containing vulnerability indicators (referred to as SVI).

For the COVID-19 data, the code groups the data by county FIPS code, county name, and state name to organize the cases geographically. It then calculates the total number of confirmed COVID-19 cases for each county and sorts these totals in descending order, so that counties with the most cases and deaths appear first.

For the SVI data, the code similarly groups the data by geographical identifiers but goes on to calculate the average for a range of vulnerability indicators within each county. These indicators include measures of poverty, linguistic isolation, the proportion of the elderly population, minority status, unemployment rates, education levels, and others that together create a profile of each county’s social vulnerability.

head(svi_all_counties_deaths)
##   county_fips         COUNTY      STATE EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY
## 1        6037    Los Angeles California  23.87737 12.828138 13.89510  72.85186
## 2        6065      Riverside California  23.09419  8.239341 16.69419  63.61318
## 3        6073      San Diego California  18.12154  6.174760 14.85514  53.03333
## 4        6059         Orange California  16.28873  8.660458 15.87712  57.10507
## 5        6071 San Bernardino California  25.56299  7.435714 12.62857  69.68442
## 6        6085    Santa Clara California  11.68260  8.835294 13.83750  68.57598
##   EP_UNEMP EP_NOHSDP EP_AGE17 EP_DISABL EP_SNGPNT  EP_MUNIT EP_MOBILE  EP_CROWD
## 1 6.473968  20.79826 21.18955 10.177733  6.566640 24.788340  1.696275 12.440850
## 2 7.513178  18.21647 23.74128 12.406589  6.296705  7.409884  8.743023  7.675194
## 3 6.215501  11.81687 20.85789 10.109602  5.670508 20.062963  3.522908  7.106996
## 4 4.943954  13.27876 21.45654  9.089052  5.101307 17.061765  2.833497  9.299020
## 5 7.477273  19.43745 26.06212 11.952597  8.187013  9.546537  5.960606  9.195238
## 6 4.212745  11.63505 21.92353  8.268382  4.490441 19.685049  2.934559  8.681127
##   EP_NOVEH EP_GROUPQ RPL_THEMES deaths
## 1 8.950891  1.643563  0.6599071 262786
## 2 4.189535  1.295349  0.6282279  36583
## 3 5.362414  1.894650  0.5263907  30093
## 4 4.304248  1.171405  0.5076575  22931
## 5 4.872727  1.179654  0.6663433  21268
## 6 5.313235  1.862500  0.4473218  15714

Above table shows the top counties with the highest COVID-19 deaths and their mean SVI

Create timeline of covid death

The aggreagated data created before is used to visualize the timeline of COVID-19 cases in different counties in CA

# Which are the 10 most affected US counties?
ca_deaths_top_10 <- head(us_deaths_all_counties, 10)

# Preparing a list with the 10 most affected counties:
top_10_deaths_fips_list <- ca_deaths_top_10$county_fips

ca_deaths_top_10_datetime <- subset(us_deaths_ca, county_fips %in% top_10_deaths_fips_list)

# Plot Timeline of cases
ca_deaths_top_10_datetime$datetime <- as.Date(ca_deaths_top_10_datetime$datetime)  # Make sure datetime is in Date format

f <- ggplot(ca_deaths_top_10_datetime, aes(x = datetime, y = deaths, color = county_name)) +
  geom_line() +
  labs(x = 'Timeline', y = 'deaths', title = 'deaths in CA counties') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_color_discrete(name = 'County Name')+
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")   # Format the x-axis with month and year

print(f)

Observation

The graph shows a timeline of COVID-19 cases, where the number of deaths in Los Angeles County appears to exceed 4,000. In contrast, other counties represented on the graph, such as Alameda, Orange, Riverside, and others, show a much flatter curve with far fewer deaths over the same period. This visualization starkly illustrates the disparate impact of the pandemic across different regions within the state, with Los Angeles County being the most severely affected in terms of mortality. Both graphs provide a clear visual comparison of the pandemic’s impact at the county level and underline the importance of localized data in understanding and responding to the spread of COVID-19.

Correlation Matrix
Correlation matrix #1

Corelation matrix between confirmed cases and SVI variables

# Plot the correlation matrix
# Select relevant columns
svi_all_counties_cases_corr <- svi_all_counties_cases[c('confirmed_cases', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY', 'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT', 'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES')]

# Calculate correlation matrix
corr <- cor(svi_all_counties_cases_corr)
corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)

Observation:

The matrix correlates various social vulnerability indices (like poverty, linguistic isolation, age, minority status, unemployment, etc.) with confirmed COVID-19 cases.

There’s a strong positive correlation between confirmed cases and EP_POV150 (representing poverty), EP_MINRTY (minority status), and EP_NOHSDP (individuals without a high school diploma). These strong positive correlations suggest that counties with higher poverty rates, larger minority populations, and more individuals without a high school diploma tend to have more confirmed COVID-19 cases.

EP_LIMENG (limited English proficiency) and EP_AGE65 (seniors over 65) have strong negative correlations with confirmed cases, which might indicate that areas with higher numbers of non-English speakers or senior citizens have fewer confirmed cases. However, this could also be due to under reporting or less testing in these communities.

The correlation with RPL_THEMES is also moderately strong, suggesting that higher overall vulnerability is associated with more confirmed cases.

Correlation matrix #2

Corelation matrix between deaths and SVI variables

# Plot the correlation matrix
# Select relevant columns
svi_all_counties_deaths_corr <- svi_all_counties_deaths[c('deaths', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY', 'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT', 'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES')]

# Calculate correlation matrix
corr <- cor(svi_all_counties_deaths_corr)
corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)

Observation:

This matrix analyzes the same set of variables but correlates them with COVID-19 deaths instead of cases.

EP_POV150, EP_MINRTY, and EP_NOHSDP again show moderate to strong positive correlations with COVID-19 deaths, indicating that higher poverty, higher minority populations, and lower education levels are associated with higher death counts.

Similar to the confirmed cases, EP_LIMENG and EP_AGE65 show negative correlations with deaths, which could suggest lower mortality in communities with these characteristics, although the reasons for this would require further investigation.

Correlation matrix #3

correlation matrix for SVI variables vs confirmed cases for just the most affected counties

# Select the rows in the SVI dataframe that contain the ten counties with most confirmed cases.
svi_top_10_counties_cases <- SVI[SVI$county_fips %in% top_10_cases_fips_list, ]

# Group by county, COUNTY, and STATE, and calculate the mean values for selected indicators.
svi_top_10_counties_cases <- svi_top_10_counties_cases %>%
  group_by(county_fips, COUNTY, STATE) %>%
  summarise(
    EP_POV150 = mean(EP_POV150),
    EP_LIMENG = mean(EP_LIMENG),
    EP_AGE65 = mean(EP_AGE65),
    EP_MINRTY = mean(EP_MINRTY),
    EP_UNEMP = mean(EP_UNEMP),
    EP_NOHSDP = mean(EP_NOHSDP),
    EP_AGE17 = mean(EP_AGE17),
    EP_DISABL = mean(EP_DISABL),
    EP_SNGPNT = mean(EP_SNGPNT),
    EP_MUNIT = mean(EP_MUNIT),
    EP_MOBILE = mean(EP_MOBILE),
    EP_CROWD = mean(EP_CROWD),
    EP_NOVEH = mean(EP_NOVEH),
    EP_GROUPQ = mean(EP_GROUPQ),
    RPL_THEMES = mean(RPL_THEMES)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'county_fips', 'COUNTY'. You can override
## using the `.groups` argument.
# Merge county SVI with confirmed cases of Covid-19
svi_top_10_counties_cases <- merge(svi_top_10_counties_cases, ca_cases_top_10, by = 'county_fips') %>%
  arrange(desc(confirmed_cases)) %>%
  select(-county_name, -state_name)

# Select columns for correlation analysis
svi_top_10_counties_cases_corr <- svi_top_10_counties_cases[c(
  'confirmed_cases', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY',
  'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT',
  'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES'
)]

# Calculate correlation matrix
corr <- cor(svi_top_10_counties_cases_corr)

corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)

Observation:

There is a notable negative correlation between confirmed cases and EP_DISABL (presumably representing disabled individuals), EP_SNGPNT (single-parent households), and EP_MOBILE (mobile home residency). These negative values suggest that as the proportion of disabled individuals, single-parent households, or mobile home residencies increases, the number of confirmed cases tends to decrease, which could be due to various factors such as isolation or reduced social contact.

The variable EP_UNEMP (unemployment) has a strong positive correlation with EP_NOHSDP (no high school diploma) and EP_POV150 (poverty), which might indicate that socioeconomic factors are interconnected and potentially contribute to vulnerability to COVID-19.

EP_CROWD (crowded housing) shows a substantial positive correlation with confirmed cases, hinting that living conditions with limited space might contribute to the spread of the virus.

The variable RPL_THEMES, with moderate positive correlations across the board, might be an aggregate indicator of risk, showing that overall vulnerability is linked to the number of confirmed cases.

Correlation matrix #4

Create correlation matrix for SVI vs number of deaths for just the most affected counties

# Select the rows in the SVI dataframe that contain the ten counties with most confirmed cases.
svi_top_10_counties_deaths <- SVI[SVI$county_fips %in% top_10_deaths_fips_list, ]

# Group by county, COUNTY, and STATE, and calculate the mean values for selected indicators.
svi_top_10_counties_deaths <- svi_top_10_counties_deaths %>%
  group_by(county_fips, COUNTY, STATE) %>%
  summarise(
    EP_POV150 = mean(EP_POV150),
    EP_LIMENG = mean(EP_LIMENG),
    EP_AGE65 = mean(EP_AGE65),
    EP_MINRTY = mean(EP_MINRTY),
    EP_UNEMP = mean(EP_UNEMP),
    EP_NOHSDP = mean(EP_NOHSDP),
    EP_AGE17 = mean(EP_AGE17),
    EP_DISABL = mean(EP_DISABL),
    EP_SNGPNT = mean(EP_SNGPNT),
    EP_MUNIT = mean(EP_MUNIT),
    EP_MOBILE = mean(EP_MOBILE),
    EP_CROWD = mean(EP_CROWD),
    EP_NOVEH = mean(EP_NOVEH),
    EP_GROUPQ = mean(EP_GROUPQ),
    RPL_THEMES = mean(RPL_THEMES)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'county_fips', 'COUNTY'. You can override
## using the `.groups` argument.
# Merge county SVI with confirmed cases of Covid-19
svi_top_10_counties_deaths <- merge(svi_top_10_counties_deaths, ca_deaths_top_10, by = 'county_fips') %>%
  arrange(desc(deaths)) %>%
  select(-county_name, -state_name)

# Select columns for correlation analysis
svi_top_10_counties_deaths_corr <- svi_top_10_counties_deaths[c(
  'deaths', 'EP_POV150', 'EP_LIMENG', 'EP_AGE65', 'EP_MINRTY',
  'EP_UNEMP', 'EP_NOHSDP', 'EP_AGE17', 'EP_DISABL', 'EP_SNGPNT',
  'EP_MUNIT', 'EP_MOBILE', 'EP_CROWD', 'EP_NOVEH', 'EP_GROUPQ', 'RPL_THEMES'
)]

# Calculate correlation matrix
corr <- cor(svi_top_10_counties_deaths_corr)

corrplot(corr, is.corr = TRUE, method = 'color', na.label='NA',col = colorRampPalette(c("white", "darkblue"))(100), addCoef.col = 'black', number.cex = 0.5,mar = c(0, 0, 2, 0), tl.cex = 0.5, tl.col = 'black', tl.srt = 45)

Observation:

EP_LIMENG (limited English proficiency) and EP_MINRTY (minority status) show moderate positive correlations with deaths, suggesting that these communities may experience higher mortality rates from COVID-19.

Interestingly, the negative correlations observed in the confirmed cases matrix are not as pronounced here. EP_DISABL, EP_SNGPNT, and EP_MOBILE do not exhibit strong negative correlations with deaths, which could imply that while these factors are associated with fewer cases, they do not necessarily predict a lower number of deaths.

The EP_CROWD variable also shows a positive correlation with deaths, which supports the idea that crowded living conditions may not only facilitate the spread of the virus but also contribute to higher mortality.

RPL_THEMES again shows a moderate positive correlation with deaths, consistent with the notion that overall social vulnerability is related to worse outcomes in the pandemic.

Conclusion from correlation matrix:

The matrices indicate that social vulnerability factors are significantly correlated with both the spread of COVID-19 and its mortality impact.

There are strong inter-correlations between several vulnerability indicators, suggesting that these factors are often co-occurring in the affected communities.

While both matrices show that higher vulnerability correlates with negative COVID-19 outcomes, the patterns are not identical for confirmed cases and deaths, indicating that different factors may play a role in infection rates versus mortality rates.

Correlation does not always imply causation hence these relationships should be interpreted with caution.

Modeling
Multiple Linear Regression for confirmed cases #1

Initial MLR for confirmed cases

library(caret)
# Preparing the independent variable (X) and the dependent variable (y)
y <- svi_all_counties_cases$RPL_THEMES
X <- svi_all_counties_cases[, c('confirmed_cases','EP_POV150','EP_LIMENG','EP_AGE65', 'EP_MINRTY',  'EP_UNEMP', 'EP_NOHSDP','EP_AGE17','EP_DISABL','EP_SNGPNT','EP_MUNIT','EP_MOBILE','EP_CROWD','EP_NOVEH','EP_GROUPQ')]

# Splitting the data into train and test data
set.seed(42)  # For reproducibility
train_indices <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]

# Initiate the model
lm_model1 <- lm(y_train ~ ., data = cbind(y_train, X_train))

# Predictions on the test set
y_predict <- predict(lm_model1, newdata = X_test)
summary(lm_model1)
## 
## Call:
## lm(formula = y_train ~ ., data = cbind(y_train, X_train))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121482 -0.028111 -0.001598  0.027159  0.099589 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     -9.728e-02  2.268e-01  -0.429   0.6709  
## confirmed_cases -6.660e-09  1.070e-08  -0.622   0.5381  
## EP_POV150        6.504e-03  3.441e-03   1.890   0.0678 .
## EP_LIMENG       -8.797e-03  8.352e-03  -1.053   0.3001  
## EP_AGE65         4.400e-03  5.197e-03   0.847   0.4035  
## EP_MINRTY        1.050e-03  1.782e-03   0.589   0.5599  
## EP_UNEMP         1.083e-02  1.006e-02   1.077   0.2897  
## EP_NOHSDP        9.497e-03  5.626e-03   1.688   0.1011  
## EP_AGE17         1.796e-03  7.887e-03   0.228   0.8213  
## EP_DISABL        3.163e-03  6.040e-03   0.524   0.6041  
## EP_SNGPNT        1.536e-02  8.527e-03   1.802   0.0810 .
## EP_MUNIT         7.900e-03  3.081e-03   2.564   0.0152 *
## EP_MOBILE        5.645e-03  3.180e-03   1.775   0.0854 .
## EP_CROWD         1.146e-04  8.008e-03   0.014   0.9887  
## EP_NOVEH        -4.033e-03  4.701e-03  -0.858   0.3973  
## EP_GROUPQ       -1.415e-03  3.753e-03  -0.377   0.7087  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05882 on 32 degrees of freedom
## Multiple R-squared:   0.86,  Adjusted R-squared:  0.7944 
## F-statistic: 13.11 on 15 and 32 DF,  p-value: 1.188e-09
# Score (R-squared)
print(summary(lm_model1)$r.squared)
## [1] 0.8600357
# Coefficients
confirmed_cases_all_counties_coef <- data.frame(Coef = coef(lm_model1)[-1], Features = names(coef(lm_model1)[-1]))
confirmed_cases_all_counties_coef <- confirmed_cases_all_counties_coef[order(-confirmed_cases_all_counties_coef$Coef), ]

The comprehensive model incorporates all variables under consideration, including confirmed COVID-19 cases. It demonstrates a robust multiple R-squared value of 0.86, elucidating that approximately 86% of the variance in the Social Vulnerability Index (SVI) score, denoted as RPL_THEMES, is accounted for by this model. This high degree of explained variability signifies the model’s strong predictive capacity.

Upon examining the individual contributions of each predictor, we observe a spectrum of significance levels. Notably, the variable EP_MUNIT, representing the prevalence of multi-unit housing, emerges as a statistically significant predictor at the 0.05 level, indicative of a substantial correlation with the SVI score. Additionally, variables such as EP_POV150 (poverty indicator), EP_SNGPNT (single-parent households), and EP_MOBILE (mobile home residency) exhibit potential significance with p-values close to 0.1, suggesting their probable but less definitive influence on the SVI score.

Conversely, the coefficient pertaining to confirmed COVID-19 cases does not reach a level of statistical significance. This absence of a clear linear association within the context of the model implies that while confirmed cases are an integral aspect of the public health landscape, they do not necessarily predict the SVI score directly when other variables are considered. This insight underscores the multifaceted nature of social vulnerability and the intricate interplay of various determinants.

Feature Selection

A stepwise feature selection process based on the Akaike information criterion (AIC) is applied to refine the model. The goal is to identify a subset of variables that provides a parsimonious model without compromising the explanatory power. The feature selection process sequentially removes variables that contribute the least to the model (e.g., EP_CROWD is removed first). The final model is more streamlined and focuses on predictors that contribute more significantly to the variation in confirmed cases.

# Stepwise feature selection
lm_model2 <- step(lm_model1, direction = "both")
## Start:  AIC=-259.45
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 + 
##     EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL + 
##     EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD + EP_NOVEH + 
##     EP_GROUPQ
## 
##                   Df Sum of Sq     RSS     AIC
## - EP_CROWD         1 0.0000007 0.11072 -261.45
## - EP_AGE17         1 0.0001795 0.11090 -261.38
## - EP_GROUPQ        1 0.0004916 0.11121 -261.24
## - EP_DISABL        1 0.0009488 0.11167 -261.04
## - EP_MINRTY        1 0.0012011 0.11192 -260.94
## - confirmed_cases  1 0.0013401 0.11206 -260.88
## - EP_AGE65         1 0.0024803 0.11320 -260.39
## - EP_NOVEH         1 0.0025468 0.11327 -260.36
## - EP_LIMENG        1 0.0038386 0.11456 -259.82
## - EP_UNEMP         1 0.0040099 0.11473 -259.75
## <none>                         0.11072 -259.45
## - EP_NOHSDP        1 0.0098593 0.12058 -257.36
## - EP_MOBILE        1 0.0109020 0.12162 -256.94
## - EP_SNGPNT        1 0.0112298 0.12195 -256.82
## - EP_POV150        1 0.0123617 0.12308 -256.37
## - EP_MUNIT         1 0.0227473 0.13347 -252.48
## 
## Step:  AIC=-261.45
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 + 
##     EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL + 
##     EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH + EP_GROUPQ
## 
##                   Df Sum of Sq     RSS     AIC
## - EP_AGE17         1 0.0001788 0.11090 -263.38
## - EP_GROUPQ        1 0.0005004 0.11122 -263.24
## - EP_DISABL        1 0.0009537 0.11168 -263.04
## - EP_MINRTY        1 0.0013562 0.11208 -262.87
## - confirmed_cases  1 0.0014274 0.11215 -262.84
## - EP_AGE65         1 0.0024822 0.11321 -262.39
## - EP_NOVEH         1 0.0025720 0.11330 -262.35
## - EP_LIMENG        1 0.0039575 0.11468 -261.77
## - EP_UNEMP         1 0.0040167 0.11474 -261.74
## <none>                         0.11072 -261.45
## + EP_CROWD         1 0.0000007 0.11072 -259.45
## - EP_NOHSDP        1 0.0098658 0.12059 -259.36
## - EP_MOBILE        1 0.0110205 0.12174 -258.90
## - EP_SNGPNT        1 0.0113076 0.12203 -258.79
## - EP_POV150        1 0.0130786 0.12380 -258.09
## - EP_MUNIT         1 0.0227910 0.13351 -254.47
## 
## Step:  AIC=-263.38
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 + 
##     EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_DISABL + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE + EP_NOVEH + EP_GROUPQ
## 
##                   Df Sum of Sq     RSS     AIC
## - EP_GROUPQ        1 0.0009348 0.11184 -264.97
## - EP_DISABL        1 0.0009998 0.11190 -264.94
## - EP_MINRTY        1 0.0013309 0.11223 -264.80
## - confirmed_cases  1 0.0021979 0.11310 -264.43
## - EP_AGE65         1 0.0028648 0.11377 -264.15
## <none>                         0.11090 -263.38
## - EP_LIMENG        1 0.0051724 0.11607 -263.19
## - EP_UNEMP         1 0.0055187 0.11642 -263.04
## - EP_NOVEH         1 0.0055872 0.11649 -263.02
## + EP_AGE17         1 0.0001788 0.11072 -261.45
## + EP_CROWD         1 0.0000000 0.11090 -261.38
## - EP_SNGPNT        1 0.0111661 0.12207 -260.77
## - EP_MOBILE        1 0.0115235 0.12243 -260.63
## - EP_POV150        1 0.0142205 0.12512 -259.58
## - EP_NOHSDP        1 0.0167041 0.12761 -258.64
## - EP_MUNIT         1 0.0246999 0.13560 -255.72
## 
## Step:  AIC=-264.97
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 + 
##     EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_DISABL + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE + EP_NOVEH
## 
##                   Df Sum of Sq     RSS     AIC
## - EP_DISABL        1 0.0010079 0.11284 -266.54
## - EP_MINRTY        1 0.0017544 0.11359 -266.23
## - confirmed_cases  1 0.0020433 0.11388 -266.10
## - EP_AGE65         1 0.0030932 0.11493 -265.66
## <none>                         0.11184 -264.97
## - EP_LIMENG        1 0.0048490 0.11668 -264.94
## - EP_UNEMP         1 0.0050493 0.11689 -264.85
## - EP_NOVEH         1 0.0055482 0.11738 -264.65
## + EP_GROUPQ        1 0.0009348 0.11090 -263.38
## + EP_AGE17         1 0.0006132 0.11122 -263.24
## + EP_CROWD         1 0.0000043 0.11183 -262.97
## - EP_MOBILE        1 0.0107699 0.12261 -262.56
## - EP_SNGPNT        1 0.0107766 0.12261 -262.56
## - EP_POV150        1 0.0155872 0.12742 -260.71
## - EP_NOHSDP        1 0.0158809 0.12772 -260.60
## - EP_MUNIT         1 0.0237719 0.13561 -257.72
## 
## Step:  AIC=-266.54
## y_train ~ confirmed_cases + EP_POV150 + EP_LIMENG + EP_AGE65 + 
##     EP_MINRTY + EP_UNEMP + EP_NOHSDP + EP_SNGPNT + EP_MUNIT + 
##     EP_MOBILE + EP_NOVEH
## 
##                   Df Sum of Sq     RSS     AIC
## - confirmed_cases  1 0.0017218 0.11457 -267.81
## - EP_MINRTY        1 0.0018261 0.11467 -267.77
## - EP_NOVEH         1 0.0045410 0.11739 -266.65
## <none>                         0.11284 -266.54
## - EP_AGE65         1 0.0053206 0.11817 -266.33
## - EP_UNEMP         1 0.0080345 0.12088 -265.24
## + EP_DISABL        1 0.0010079 0.11184 -264.97
## + EP_GROUPQ        1 0.0009429 0.11190 -264.94
## + EP_AGE17         1 0.0006913 0.11215 -264.84
## + EP_CROWD         1 0.0000018 0.11284 -264.54
## - EP_SNGPNT        1 0.0102151 0.12306 -264.38
## - EP_LIMENG        1 0.0102823 0.12313 -264.36
## - EP_MOBILE        1 0.0121526 0.12500 -263.63
## - EP_POV150        1 0.0154489 0.12829 -262.38
## - EP_MUNIT         1 0.0227824 0.13563 -259.71
## - EP_NOHSDP        1 0.0230720 0.13592 -259.61
## 
## Step:  AIC=-267.81
## y_train ~ EP_POV150 + EP_LIMENG + EP_AGE65 + EP_MINRTY + EP_UNEMP + 
##     EP_NOHSDP + EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH
## 
##                   Df Sum of Sq     RSS     AIC
## - EP_MINRTY        1 0.0018962 0.11646 -269.03
## - EP_NOVEH         1 0.0033541 0.11792 -268.43
## - EP_AGE65         1 0.0044758 0.11904 -267.98
## <none>                         0.11457 -267.81
## + confirmed_cases  1 0.0017218 0.11284 -266.54
## + EP_AGE17         1 0.0015160 0.11305 -266.45
## + EP_GROUPQ        1 0.0007971 0.11377 -266.15
## - EP_SNGPNT        1 0.0091792 0.12374 -266.12
## + EP_DISABL        1 0.0006863 0.11388 -266.10
## - EP_UNEMP         1 0.0093913 0.12396 -266.03
## + EP_CROWD         1 0.0002027 0.11436 -265.90
## - EP_MOBILE        1 0.0111971 0.12576 -265.34
## - EP_LIMENG        1 0.0117731 0.12634 -265.12
## - EP_POV150        1 0.0146330 0.12920 -264.05
## - EP_MUNIT         1 0.0214240 0.13599 -261.59
## - EP_NOHSDP        1 0.0225704 0.13714 -261.18
## 
## Step:  AIC=-269.03
## y_train ~ EP_POV150 + EP_LIMENG + EP_AGE65 + EP_UNEMP + EP_NOHSDP + 
##     EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH
## 
##                   Df Sum of Sq     RSS     AIC
## - EP_AGE65         1  0.002585 0.11905 -269.97
## <none>                         0.11646 -269.03
## - EP_NOVEH         1  0.005665 0.12213 -268.75
## + EP_MINRTY        1  0.001896 0.11457 -267.81
## + confirmed_cases  1  0.001792 0.11467 -267.77
## + EP_AGE17         1  0.001676 0.11479 -267.72
## + EP_GROUPQ        1  0.001209 0.11525 -267.53
## + EP_DISABL        1  0.000741 0.11572 -267.33
## - EP_MOBILE        1  0.009960 0.12642 -267.09
## - EP_LIMENG        1  0.009978 0.12644 -267.08
## + EP_CROWD         1  0.000000 0.11646 -267.03
## - EP_SNGPNT        1  0.010319 0.12678 -266.95
## - EP_POV150        1  0.012938 0.12940 -265.97
## - EP_UNEMP         1  0.020313 0.13678 -263.31
## - EP_NOHSDP        1  0.026480 0.14294 -261.19
## - EP_MUNIT         1  0.038744 0.15521 -257.24
## 
## Step:  AIC=-269.97
## y_train ~ EP_POV150 + EP_LIMENG + EP_UNEMP + EP_NOHSDP + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE + EP_NOVEH
## 
##                   Df Sum of Sq     RSS     AIC
## <none>                         0.11905 -269.97
## - EP_NOVEH         1  0.005262 0.12431 -269.90
## + EP_AGE65         1  0.002585 0.11646 -269.03
## + EP_DISABL        1  0.002296 0.11675 -268.91
## - EP_SNGPNT        1  0.007954 0.12700 -268.87
## + EP_GROUPQ        1  0.001131 0.11792 -268.43
## + confirmed_cases  1  0.000876 0.11817 -268.33
## + EP_AGE17         1  0.000069 0.11898 -268.00
## + EP_CROWD         1  0.000038 0.11901 -267.99
## + EP_MINRTY        1  0.000005 0.11904 -267.98
## - EP_MOBILE        1  0.010645 0.12969 -267.86
## - EP_POV150        1  0.011651 0.13070 -267.49
## - EP_LIMENG        1  0.012139 0.13119 -267.31
## - EP_NOHSDP        1  0.024100 0.14315 -263.12
## - EP_UNEMP         1  0.034478 0.15352 -259.76
## - EP_MUNIT         1  0.036164 0.15521 -259.24

Model after feature selection

# Display the selected model
summary(lm_model2)
## 
## Call:
## lm(formula = y_train ~ EP_POV150 + EP_LIMENG + EP_UNEMP + EP_NOHSDP + 
##     EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_NOVEH, data = cbind(y_train, 
##     X_train))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.105153 -0.028649 -0.001766  0.031433  0.102611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.096140   0.043226   2.224  0.03200 * 
## EP_POV150    0.005710   0.002923   1.954  0.05793 . 
## EP_LIMENG   -0.011330   0.005681  -1.994  0.05316 . 
## EP_UNEMP     0.020010   0.005954   3.361  0.00175 **
## EP_NOHSDP    0.010552   0.003755   2.810  0.00771 **
## EP_SNGPNT    0.011662   0.007224   1.614  0.11453   
## EP_MUNIT     0.007165   0.002082   3.442  0.00139 **
## EP_MOBILE    0.004914   0.002631   1.867  0.06937 . 
## EP_NOVEH    -0.003763   0.002866  -1.313  0.19688   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05525 on 39 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.8186 
## F-statistic: 27.52 on 8 and 39 DF,  p-value: 9.513e-14
# Get the selected features
selected_features <- names(coef(lm_model2)[-1])
selected_features
## [1] "EP_POV150" "EP_LIMENG" "EP_UNEMP"  "EP_NOHSDP" "EP_SNGPNT" "EP_MUNIT" 
## [7] "EP_MOBILE" "EP_NOVEH"

Observation:

The refined model, referred to as lm_model2, retains a robust explanatory capacity with a Multiple R-squared of 0.8506, mirroring the initial model’s explanatory power. This is particularly noteworthy as it was achieved with a reduced set of predictors, indicating that the retained variables are indeed pivotal in explaining the variance in the social vulnerability index.

The Adjusted R-squared value of our refined model saw a marginal increase from the original model, suggesting that the model’s efficiency has been enhanced when adjusting for the number of predictors. This underscores the effectiveness of the feature selection process in isolating the most influential variables while eschewing redundant or less informative ones.

Statistical analysis of lm_model2 reveals that variables such as unemployment rates, lack of high school diploma attainment, and multi-unit housing remain statistically significant, reinforcing their roles as critical indicators of social vulnerability in the context of the pandemic.

A decrease in the residual standard error in lm_model2 compared to the initial model suggests a tighter fit to the data, enhancing the predictive accuracy of the model.

The refined model presents an optimal balance between model complexity and explanatory power, facilitating a nuanced understanding of the factors contributing to social vulnerability during the COVID-19 pandemic. This balance is crucial in the scientific pursuit of parsimony, ensuring that the model is not overly complex while still capturing the essential dynamics at play.

Multiple Linear Regression for number of deaths #2

Initial MLR for deaths

library(caret)

# Preparing the independent variable (X) and the dependent variable (y)
y <- svi_all_counties_deaths$RPL_THEMES
X <- svi_all_counties_deaths[, c('deaths','EP_POV150','EP_LIMENG','EP_AGE65', 'EP_MINRTY',  'EP_UNEMP', 'EP_NOHSDP','EP_AGE17','EP_DISABL','EP_SNGPNT','EP_MUNIT','EP_MOBILE','EP_CROWD','EP_NOVEH','EP_GROUPQ')]
#X <- svi_all_counties_deaths[, c('EP_LIMENG','EP_NOHSDP', 'EP_AGE17','EP_MUNIT','EP_CROWD','EP_NOVEH','EP_GROUPQ','EP_DISABL')]


# Splitting the data into train and test data
set.seed(42)  # For reproducibility
train_indices <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]

# Initiate the model
lm_model3 <- lm(y_train ~ ., data = cbind(y_train, X_train))

# Predictions on the test set
y_predict <- predict(lm_model3, newdata = X_test)
summary(lm_model3)
## 
## Call:
## lm(formula = y_train ~ ., data = cbind(y_train, X_train))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.133460 -0.033317  0.001034  0.037791  0.105024 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.124e-01  2.181e-01  -0.515   0.6099  
## deaths      -1.330e-07  3.570e-07  -0.373   0.7119  
## EP_POV150    7.487e-03  3.713e-03   2.016   0.0522 .
## EP_LIMENG   -2.473e-03  8.619e-03  -0.287   0.7760  
## EP_AGE65     5.114e-03  5.406e-03   0.946   0.3512  
## EP_MINRTY    1.191e-03  1.925e-03   0.619   0.5405  
## EP_UNEMP     4.710e-03  9.641e-03   0.489   0.6285  
## EP_NOHSDP    8.146e-03  6.664e-03   1.222   0.2305  
## EP_AGE17     2.942e-03  9.066e-03   0.324   0.7477  
## EP_DISABL    4.634e-03  7.012e-03   0.661   0.5134  
## EP_SNGPNT    1.514e-02  9.044e-03   1.674   0.1038  
## EP_MUNIT     6.876e-03  3.447e-03   1.995   0.0546 .
## EP_MOBILE    5.193e-03  3.351e-03   1.550   0.1310  
## EP_CROWD    -6.547e-03  8.107e-03  -0.808   0.4253  
## EP_NOVEH    -1.696e-03  1.085e-02  -0.156   0.8767  
## EP_GROUPQ   -2.268e-03  4.252e-03  -0.533   0.5974  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06291 on 32 degrees of freedom
## Multiple R-squared:  0.8377, Adjusted R-squared:  0.7616 
## F-statistic: 11.01 on 15 and 32 DF,  p-value: 1.087e-08
# Score (R-squared)
print(summary(lm_model3)$r.squared)
## [1] 0.837654
# Coefficients
deaths_all_counties_coef <- data.frame(Coef = coef(lm_model3)[-1], Features = names(coef(lm_model3)[-1]))
deaths_all_counties_coef <- deaths_all_counties_coef[order(-deaths_all_counties_coef$Coef), ]

The model’s summary reveals that while the Multiple R-squared value is 0.8377 indicating that approximately 83.77% of the variance in the SVI score is captured by the model, the individual predictors’ contributions vary in statistical significance. Notably, EP_POV150 (poverty) shows a p-value close to the traditional threshold of statistical significance, suggesting a meaningful association with the SVI score. EP_MUNIT (multi-unit housing) also approaches significance, potentially reflecting the impact of housing conditions on social vulnerability.

However, the variable deaths does not exhibit a significant association with the SVI score within the context of this model, suggesting that the number of deaths alone is not a direct predictor of social vulnerability as encapsulated by the RPL_THEMES score when considering the influence of the other variables.

The model yields a residual standard error of 0.06291, and despite a substantial F-statistic, we need to take caution in the interpretation of non-significant variables. The Adjusted R-squared of 0.7616 reflects the model’s explanatory power after accounting for the number of predictors, indicating a good fit to the data while acknowledging the reduced number of variables.

Feature Selection for deaths MLR

# Stepwise feature selection
lm_model4 <- step(lm_model3, direction = "both")
## Start:  AIC=-253
## y_train ~ deaths + EP_POV150 + EP_LIMENG + EP_AGE65 + EP_MINRTY + 
##     EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE + EP_CROWD + EP_NOVEH + EP_GROUPQ
## 
##             Df Sum of Sq     RSS     AIC
## - EP_NOVEH   1 0.0000968 0.12675 -254.96
## - EP_LIMENG  1 0.0003258 0.12698 -254.88
## - EP_AGE17   1 0.0004167 0.12707 -254.84
## - deaths     1 0.0005495 0.12720 -254.79
## - EP_UNEMP   1 0.0009447 0.12760 -254.64
## - EP_GROUPQ  1 0.0011261 0.12778 -254.57
## - EP_MINRTY  1 0.0015151 0.12817 -254.43
## - EP_DISABL  1 0.0017286 0.12838 -254.35
## - EP_CROWD   1 0.0025812 0.12923 -254.03
## - EP_AGE65   1 0.0035418 0.13020 -253.68
## <none>                   0.12665 -253.00
## - EP_NOHSDP  1 0.0059131 0.13257 -252.81
## - EP_MOBILE  1 0.0095060 0.13616 -251.53
## - EP_SNGPNT  1 0.0110964 0.13775 -250.97
## - EP_MUNIT   1 0.0157516 0.14240 -249.37
## - EP_POV150  1 0.0160879 0.14274 -249.26
## 
## Step:  AIC=-254.96
## y_train ~ deaths + EP_POV150 + EP_LIMENG + EP_AGE65 + EP_MINRTY + 
##     EP_UNEMP + EP_NOHSDP + EP_AGE17 + EP_DISABL + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE + EP_CROWD + EP_GROUPQ
## 
##             Df Sum of Sq     RSS     AIC
## - EP_LIMENG  1 0.0004570 0.12721 -256.79
## - EP_AGE17   1 0.0005486 0.12730 -256.76
## - deaths     1 0.0005919 0.12734 -256.74
## - EP_UNEMP   1 0.0008853 0.12763 -256.63
## - EP_GROUPQ  1 0.0011276 0.12788 -256.54
## - EP_MINRTY  1 0.0016409 0.12839 -256.35
## - EP_DISABL  1 0.0017022 0.12845 -256.32
## - EP_CROWD   1 0.0027816 0.12953 -255.92
## <none>                   0.12675 -254.96
## - EP_AGE65   1 0.0057446 0.13249 -254.84
## - EP_NOHSDP  1 0.0064980 0.13325 -254.56
## - EP_MOBILE  1 0.0094523 0.13620 -253.51
## + EP_NOVEH   1 0.0000968 0.12665 -253.00
## - EP_SNGPNT  1 0.0110474 0.13780 -252.95
## - EP_MUNIT   1 0.0164093 0.14316 -251.12
## - EP_POV150  1 0.0165064 0.14326 -251.09
## 
## Step:  AIC=-256.79
## y_train ~ deaths + EP_POV150 + EP_AGE65 + EP_MINRTY + EP_UNEMP + 
##     EP_NOHSDP + EP_AGE17 + EP_DISABL + EP_SNGPNT + EP_MUNIT + 
##     EP_MOBILE + EP_CROWD + EP_GROUPQ
## 
##             Df Sum of Sq     RSS     AIC
## - deaths     1 0.0004491 0.12766 -258.62
## - EP_UNEMP   1 0.0006280 0.12784 -258.56
## - EP_GROUPQ  1 0.0008911 0.12810 -258.46
## - EP_MINRTY  1 0.0012334 0.12844 -258.33
## - EP_AGE17   1 0.0013211 0.12853 -258.30
## - EP_DISABL  1 0.0026955 0.12990 -257.78
## - EP_CROWD   1 0.0030686 0.13028 -257.65
## <none>                   0.12721 -256.79
## - EP_AGE65   1 0.0056972 0.13290 -256.69
## - EP_NOHSDP  1 0.0089397 0.13615 -255.53
## - EP_MOBILE  1 0.0104275 0.13763 -255.01
## + EP_LIMENG  1 0.0004570 0.12675 -254.96
## + EP_NOVEH   1 0.0002279 0.12698 -254.88
## - EP_SNGPNT  1 0.0118908 0.13910 -254.50
## - EP_MUNIT   1 0.0161630 0.14337 -253.05
## - EP_POV150  1 0.0167628 0.14397 -252.85
## 
## Step:  AIC=-258.62
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_UNEMP + EP_NOHSDP + 
##     EP_AGE17 + EP_DISABL + EP_SNGPNT + EP_MUNIT + EP_MOBILE + 
##     EP_CROWD + EP_GROUPQ
## 
##             Df Sum of Sq     RSS     AIC
## - EP_UNEMP   1 0.0006583 0.12831 -260.38
## - EP_GROUPQ  1 0.0006676 0.12832 -260.37
## - EP_MINRTY  1 0.0013009 0.12896 -260.13
## - EP_DISABL  1 0.0024034 0.13006 -259.73
## - EP_AGE17   1 0.0027035 0.13036 -259.62
## - EP_CROWD   1 0.0034957 0.13115 -259.32
## <none>                   0.12766 -258.62
## - EP_AGE65   1 0.0061289 0.13378 -258.37
## - EP_NOHSDP  1 0.0087543 0.13641 -257.44
## - EP_MOBILE  1 0.0104576 0.13811 -256.84
## + deaths     1 0.0004491 0.12721 -256.79
## + EP_LIMENG  1 0.0003142 0.12734 -256.74
## + EP_NOVEH   1 0.0002525 0.12740 -256.72
## - EP_SNGPNT  1 0.0115922 0.13925 -256.45
## - EP_POV150  1 0.0164841 0.14414 -254.79
## - EP_MUNIT   1 0.0175708 0.14523 -254.43
## 
## Step:  AIC=-260.37
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_NOHSDP + EP_AGE17 + 
##     EP_DISABL + EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD + 
##     EP_GROUPQ
## 
##             Df Sum of Sq     RSS     AIC
## - EP_GROUPQ  1  0.000375 0.12869 -262.24
## - EP_DISABL  1  0.002946 0.13126 -261.29
## - EP_MINRTY  1  0.003094 0.13141 -261.23
## - EP_CROWD   1  0.003099 0.13141 -261.23
## - EP_AGE17   1  0.003510 0.13182 -261.08
## <none>                   0.12831 -260.38
## - EP_NOHSDP  1  0.008406 0.13672 -259.33
## - EP_MOBILE  1  0.009993 0.13831 -258.77
## + EP_UNEMP   1  0.000658 0.12766 -258.62
## + deaths     1  0.000479 0.12784 -258.56
## + EP_NOVEH   1  0.000115 0.12820 -258.42
## + EP_LIMENG  1  0.000109 0.12821 -258.42
## - EP_AGE65   1  0.011418 0.13973 -258.28
## - EP_SNGPNT  1  0.012217 0.14053 -258.01
## - EP_MUNIT   1  0.016944 0.14526 -256.42
## - EP_POV150  1  0.034705 0.16302 -250.88
## 
## Step:  AIC=-262.23
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_NOHSDP + EP_AGE17 + 
##     EP_DISABL + EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD
## 
##             Df Sum of Sq     RSS     AIC
## - EP_DISABL  1  0.002736 0.13142 -263.23
## - EP_CROWD   1  0.002832 0.13152 -263.19
## - EP_MINRTY  1  0.003188 0.13188 -263.06
## - EP_AGE17   1  0.004447 0.13314 -262.60
## <none>                   0.12869 -262.24
## - EP_NOHSDP  1  0.008032 0.13672 -261.33
## - EP_MOBILE  1  0.009635 0.13832 -260.77
## + EP_GROUPQ  1  0.000375 0.12831 -260.38
## + EP_UNEMP   1  0.000366 0.12832 -260.37
## + deaths     1  0.000280 0.12841 -260.34
## + EP_NOVEH   1  0.000112 0.12858 -260.28
## + EP_LIMENG  1  0.000072 0.12862 -260.26
## - EP_SNGPNT  1  0.012046 0.14073 -259.94
## - EP_AGE65   1  0.012750 0.14144 -259.70
## - EP_MUNIT   1  0.016642 0.14533 -258.40
## - EP_POV150  1  0.034669 0.16336 -252.78
## 
## Step:  AIC=-263.22
## y_train ~ EP_POV150 + EP_AGE65 + EP_MINRTY + EP_NOHSDP + EP_AGE17 + 
##     EP_SNGPNT + EP_MUNIT + EP_MOBILE + EP_CROWD
## 
##             Df Sum of Sq     RSS     AIC
## - EP_MINRTY  1  0.004196 0.13562 -263.72
## - EP_AGE17   1  0.004452 0.13588 -263.63
## - EP_CROWD   1  0.004891 0.13632 -263.47
## <none>                   0.13142 -263.23
## - EP_NOHSDP  1  0.008242 0.13967 -262.31
## + EP_DISABL  1  0.002736 0.12869 -262.24
## + EP_UNEMP   1  0.000853 0.13057 -261.54
## + EP_LIMENG  1  0.000655 0.13077 -261.46
## + EP_GROUPQ  1  0.000165 0.13126 -261.29
## + EP_NOVEH   1  0.000115 0.13131 -261.27
## + deaths     1  0.000088 0.13134 -261.26
## - EP_MOBILE  1  0.012159 0.14358 -260.98
## - EP_SNGPNT  1  0.012482 0.14391 -260.87
## - EP_MUNIT   1  0.014936 0.14636 -260.06
## - EP_AGE65   1  0.029925 0.16135 -255.38
## - EP_POV150  1  0.050666 0.18209 -249.57
## 
## Step:  AIC=-263.72
## y_train ~ EP_POV150 + EP_AGE65 + EP_NOHSDP + EP_AGE17 + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE + EP_CROWD
## 
##             Df Sum of Sq     RSS     AIC
## - EP_CROWD   1  0.002397 0.13802 -264.88
## <none>                   0.13562 -263.72
## + EP_MINRTY  1  0.004196 0.13142 -263.23
## + EP_DISABL  1  0.003745 0.13188 -263.06
## + EP_UNEMP   1  0.003221 0.13240 -262.87
## - EP_AGE17   1  0.010298 0.14592 -262.20
## - EP_MOBILE  1  0.011287 0.14691 -261.88
## + EP_NOVEH   1  0.000344 0.13528 -261.84
## + EP_GROUPQ  1  0.000206 0.13542 -261.79
## + deaths     1  0.000134 0.13549 -261.76
## + EP_LIMENG  1  0.000042 0.13558 -261.73
## - EP_NOHSDP  1  0.012108 0.14773 -261.61
## - EP_SNGPNT  1  0.015806 0.15143 -260.43
## - EP_AGE65   1  0.027283 0.16290 -256.92
## - EP_MUNIT   1  0.027540 0.16316 -256.84
## - EP_POV150  1  0.046767 0.18239 -251.50
## 
## Step:  AIC=-264.88
## y_train ~ EP_POV150 + EP_AGE65 + EP_NOHSDP + EP_AGE17 + EP_SNGPNT + 
##     EP_MUNIT + EP_MOBILE
## 
##             Df Sum of Sq     RSS     AIC
## <none>                   0.13802 -264.88
## + EP_DISABL  1  0.005031 0.13299 -264.66
## + EP_CROWD   1  0.002397 0.13562 -263.72
## - EP_NOHSDP  1  0.009711 0.14773 -263.61
## + EP_UNEMP   1  0.001837 0.13618 -263.52
## + EP_MINRTY  1  0.001702 0.13632 -263.47
## + deaths     1  0.000344 0.13767 -263.00
## + EP_LIMENG  1  0.000163 0.13786 -262.93
## + EP_NOVEH   1  0.000124 0.13789 -262.92
## + EP_GROUPQ  1  0.000009 0.13801 -262.88
## - EP_AGE17   1  0.012323 0.15034 -262.77
## - EP_SNGPNT  1  0.014690 0.15271 -262.02
## - EP_MOBILE  1  0.016171 0.15419 -261.56
## - EP_MUNIT   1  0.026213 0.16423 -258.53
## - EP_AGE65   1  0.033740 0.17176 -256.38
## - EP_POV150  1  0.044386 0.18240 -253.49

Model after feature selection

# Display the selected model
summary(lm_model4)
## 
## Call:
## lm(formula = y_train ~ EP_POV150 + EP_AGE65 + EP_NOHSDP + EP_AGE17 + 
##     EP_SNGPNT + EP_MUNIT + EP_MOBILE, data = cbind(y_train, X_train))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.126798 -0.027921  0.001379  0.025232  0.109946 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.241102   0.144877  -1.664 0.103892    
## EP_POV150    0.008180   0.002281   3.587 0.000902 ***
## EP_AGE65     0.008853   0.002831   3.127 0.003286 ** 
## EP_NOHSDP    0.004919   0.002932   1.678 0.101216    
## EP_AGE17     0.010534   0.005574   1.890 0.066049 .  
## EP_SNGPNT    0.016504   0.007998   2.063 0.045608 *  
## EP_MUNIT     0.005614   0.002037   2.756 0.008763 ** 
## EP_MOBILE    0.006032   0.002786   2.165 0.036422 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05874 on 40 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.7921 
## F-statistic: 26.59 on 7 and 40 DF,  p-value: 3.772e-13
# Get the selected features
selected_features <- names(coef(lm_model4)[-1])
selected_features
## [1] "EP_POV150" "EP_AGE65"  "EP_NOHSDP" "EP_AGE17"  "EP_SNGPNT" "EP_MUNIT" 
## [7] "EP_MOBILE"

The lm_model4 model after feature selection includes the predictors “EP_POV150,” “EP_AGE65,” “EP_NOHSDP,” “EP_AGE17,” “EP_SNGPNT,” “EP_MUNIT,” and “EP_MOBILE,” which have been identified as the most influential on the Social Vulnerability Index (SVI) score in the context of COVID-19 related deaths.

The lm_model4 (model after feature selection) displays a Multiple R-squared value of 0.8231. This suggests that the model explains over 82% of the variance in the SVI score, a slight decrement from the 0.8377 observed in the full model lm_model3. However, the Adjusted R-squared value has improved to 0.7921, reinforcing the model’s robustness when adjusted for the number of predictors included.

Each variable within the lm_model4 has been evaluated for its statistical significance. Variables such as “EP_POV150” and “EP_AGE65” have shown strong significance, implying a robust relationship with the SVI score.

It is noteworthy that the residual standard error of the lm_model4 remains comparable to that of the full model, reinforcing the accuracy of this more focused model. The F-statistic’s substantial value corroborates the overall statistical validity of the lm_model4 model.

In summation, the model after feature selection represents an optimized model that balances complexity with interpretability, achieving a notable degree of explanatory power with a reduced set of variables.

SVI, Economic and COVID-19

In our pursuit to comprehensively understand the determinants of Social Vulnerability Index (SVI), we merged three distinct datasets—SVI dataset, Economy dataset, and COVID dataset—leveraging a common linkage through ‘county_fips’ and ‘Geo_ID.’ This synergistic integration aimed to harness the collective insights encapsulated in socio-economic indicators, mean income statistics, and COVID-19-related metrics at the county level.

Variable Composition:

SVI Dataset:

The SVI dataset contributes a plethora of socio-economic indicators (SVI variables) that encapsulate vulnerability dimensions such as education, income, housing, and health. These variables serve as crucial predictors for our modeling endeavor.

Economy Dataset:

The Economy dataset brings forth essential economic indicators, with a focal point on mean income. Mean income is a pivotal economic factor that intersects with the socio-economic fabric, influencing vulnerability levels in diverse communities.

COVID Dataset:

The COVID dataset enriches our amalgamation with pertinent health-related variables, namely ‘total_cases’ and ‘total_deaths.’ Inclusion of these COVID-19 metrics acknowledges the recent challenges and disruptions posed by the pandemic, providing a contemporary lens to our predictive modeling task.

Our objective is to model and predict the total SVI (RPL_THEMES) based on SVI-related variables, mean income, and COVID-19 metrics. By integrating these multifaceted dimensions, we strive to uncover the intricate relationships that contribute to social vulnerability, taking into account both pre-existing disparities and the impact of the ongoing pandemic.

This predictive modeling initiative holds significant implications for strategic interventions, policy formulation, and community resilience efforts. By unraveling the intricate interplay of socio-economic and health factors, our models aim to provide actionable insights that can inform targeted interventions and support the development of resilient communities.

In the subsequent sections, we delve into the application of Random Forest and Decision Tree models to further dissect these relationships, offering interpretability and predictive accuracy to our intricate dataset.

Data Cleaning

Before Merging the three datasets we will first clean each dataset and change respective column’s names if needed

SVI_Data <- read.csv("SVI_2020_US.csv")
head(SVI_Data)
##   ST   STATE ST_ABBR STCNTY  COUNTY       FIPS
## 1  1 Alabama      AL   1001 Autauga 1001020100
## 2  1 Alabama      AL   1001 Autauga 1001020200
## 3  1 Alabama      AL   1001 Autauga 1001020300
## 4  1 Alabama      AL   1001 Autauga 1001020400
## 5  1 Alabama      AL   1001 Autauga 1001020501
## 6  1 Alabama      AL   1001 Autauga 1001020502
##                                       LOCATION AREA_SQMI E_TOTPOP M_TOTPOP E_HU
## 1    Census Tract 201, Autauga County, Alabama 3.7935695     1941      390  710
## 2    Census Tract 202, Autauga County, Alabama 1.2821745     1757      310  720
## 3    Census Tract 203, Autauga County, Alabama 2.0653642     3694      570 1464
## 4    Census Tract 204, Autauga County, Alabama 2.4649840     3539      500 1741
## 5 Census Tract 205.01, Autauga County, Alabama 2.3952432     4306      662 1786
## 6 Census Tract 205.02, Autauga County, Alabama 0.8098065     3100      531 1285
##   M_HU E_HH M_HH E_POV150 M_POV150 E_UNEMP M_UNEMP E_HBURD M_HBURD E_NOHSDP
## 1  120  693  121      352      138      18      18     144      59      187
## 2   99  573   99      384      182      29      26     149      60      139
## 3  180 1351  173      842      317      53      45     264      96      317
## 4  200 1636  213      503      213      39      34     284     100      173
## 5  220 1706  244      903      496      23      31     351     176      251
## 6  212 1285  212      404      305      31      50     236     146       40
##   M_NOHSDP E_UNINSUR M_UNINSUR E_AGE65 M_AGE65 E_AGE17 M_AGE17 E_DISABL
## 1       93       187        91     295     101     415     208      413
## 2       59        91        55     284      97     325     110      168
## 3      122       127        85     464      68     929     253      515
## 4      108       169       126     969     270     510     155      690
## 5      171       135       138     541     180    1136     320      594
## 6       60         0        12     401     178    1037     313      280
##   M_DISABL E_SNGPNT M_SNGPNT E_LIMENG M_LIMENG E_MINRTY M_MINRTY E_MUNIT
## 1      147       51       31        0       48      437      192       0
## 2       73       21       25        0       48     1116      306       3
## 3      148      152       88      128      133     1331      548      26
## 4      255       85       65       89      130      454      266     143
## 5      204       88       66       52       84      789      358      57
## 6      146      155      129        0       48      943      406     220
##   M_MUNIT E_MOBILE M_MOBILE E_CROWD M_CROWD E_NOVEH M_NOVEH E_GROUPQ M_GROUPQ
## 1      16       88       43       0      16      10      12        0       12
## 2      13        5        8       9      16      57      37      212       85
## 3      43       14       18      35      37      42      37        0       12
## 4     115        0       12      10      16      72      93        0       12
## 5      87       29       48       0      16      44      50       11        7
## 6     142        0       12      37      64       0      12        0       12
##   EP_POV150 MP_POV150 EP_UNEMP MP_UNEMP EP_HBURD MP_HBURD EP_NOHSDP MP_NOHSDP
## 1      18.1       6.1      2.1      2.1     20.8      7.7      14.3       6.3
## 2      25.4      11.0      4.0      3.5     26.0      9.5      10.6       4.8
## 3      22.8       7.8      2.7      2.3     19.5      6.7      12.8       5.3
## 4      14.2       5.7      2.4      2.0     17.4      5.7       6.2       3.6
## 5      21.0      11.1      1.0      1.3     20.6      9.9       8.8       6.0
## 6      13.0       9.6      2.7      4.2     18.4     10.9       2.5       3.5
##   EP_UNINSUR MP_UNINSUR EP_AGE65 MP_AGE65 EP_AGE17 MP_AGE17 EP_DISABL MP_DISABL
## 1        9.6        5.1     15.2      5.1     21.4      9.8      21.3       7.4
## 2        5.9        3.7     16.2      4.6     18.5      5.3      11.0       3.8
## 3        3.5        2.4     12.6      2.5     25.1      5.6      14.0       3.5
## 4        4.8        3.3     27.4      6.5     14.4      3.9      19.6       6.2
## 5        3.2        3.3     12.6      4.6     26.4      6.2      14.0       4.9
## 6        0.0        1.2     12.9      5.5     33.5      8.3      10.0       5.5
##   EP_SNGPNT MP_SNGPNT EP_LIMENG MP_LIMENG EP_MINRTY MP_MINRTY EP_MUNIT MP_MUNIT
## 1       7.4       4.3       0.0       2.6      22.5       8.8      0.0      2.3
## 2       3.7       4.3       0.0       2.9      63.5      13.3      0.4      1.8
## 3      11.3       6.4       3.6       3.7      36.0      13.8      1.8      2.9
## 4       5.2       3.9       2.6       3.8      12.8       7.3      8.2      6.5
## 5       5.2       3.8       1.3       2.1      18.3       7.8      3.2      4.9
## 6      12.1       9.8       0.0       1.7      30.4      12.0     17.1     10.7
##   EP_MOBILE MP_MOBILE EP_CROWD MP_CROWD EP_NOVEH MP_NOVEH EP_GROUPQ MP_GROUPQ
## 1      12.4       5.8      0.0      2.3      1.4      1.8       0.0       0.6
## 2       0.7       1.1      1.6      2.8      9.9      6.3      12.1       4.3
## 3       1.0       1.2      2.6      2.7      3.1      2.8       0.0       0.3
## 4       0.0       2.0      0.6      1.0      4.4      5.4       0.0       0.3
## 5       1.6       2.7      0.0      0.9      2.6      3.0       0.3       0.2
## 6       0.0       2.7      2.9      5.0      0.0      2.7       0.0       0.4
##   EPL_POV150 EPL_UNEMP EPL_HBURD EPL_NOHSDP EPL_UNINSUR SPL_THEME1 RPL_THEME1
## 1     0.4727    0.1731    0.3448     0.6963      0.6529     2.3398     0.4578
## 2     0.6491    0.4210    0.5214     0.5673      0.4320     2.5908     0.5348
## 3     0.5917    0.2487    0.2965     0.6504      0.2399     2.0272     0.3639
## 4     0.3598    0.2096    0.2214     0.3440      0.3486     1.4834     0.2081
## 5     0.5495    0.0641    0.3375     0.4860      0.2142     1.6513     0.2540
## 6     0.3250    0.2487    0.2575     0.1165      0.0000     0.9477     0.0828
##   EPL_AGE65 EPL_AGE17 EPL_DISABL EPL_SNGPNT EPL_LIMENG SPL_THEME2 RPL_THEME2
## 1    0.4693    0.4653     0.8926     0.6627     0.0000     2.4899     0.5079
## 2    0.5252    0.2898     0.3984     0.3493     0.0000     1.5627     0.0810
## 3    0.3246    0.6915     0.5985     0.8384     0.7095     3.1625     0.8632
## 4    0.9146    0.1303     0.8487     0.4947     0.6462     3.0345     0.8131
## 5    0.3246    0.7589     0.5985     0.4947     0.5112     2.6879     0.6297
## 6    0.3415    0.9511     0.3274     0.8614     0.0000     2.4814     0.5020
##   EPL_MINRTY SPL_THEME3 RPL_THEME3 EPL_MUNIT EPL_MOBILE EPL_CROWD EPL_NOVEH
## 1     0.3921     0.3921     0.3921    0.0000     0.8180    0.0000    0.1872
## 2     0.7610     0.7610     0.7610    0.2798     0.5115    0.4821    0.7387
## 3     0.5529     0.5529     0.5529    0.3930     0.5418    0.6078    0.3651
## 4     0.2386     0.2386     0.2386    0.6117     0.0000    0.3026    0.4731
## 5     0.3313     0.3313     0.3313    0.4630     0.5860    0.0000    0.3173
## 6     0.4923     0.4923     0.4923    0.7559     0.0000    0.6378    0.0000
##   EPL_GROUPQ SPL_THEME4 RPL_THEME4 SPL_THEMES RPL_THEMES F_POV150 F_UNEMP
## 1     0.0000     1.0052     0.0945     6.2270     0.2823        0       0
## 2     0.9570     2.9691     0.7915     7.8836     0.5406        0       0
## 3     0.0000     1.9077     0.3500     7.6503     0.5042        0       0
## 4     0.0000     1.3874     0.1759     6.1439     0.2703        0       0
## 5     0.5375     1.9038     0.3484     6.5743     0.3343        0       0
## 6     0.0000     1.3937     0.1776     5.3151     0.1629        0       0
##   F_HBURD F_NOHSDP F_UNINSUR F_THEME1 F_AGE65 F_AGE17 F_DISABL F_SNGPNT
## 1       0        0         0        0       0       0        0        0
## 2       0        0         0        0       0       0        0        0
## 3       0        0         0        0       0       0        0        0
## 4       0        0         0        0       1       0        0        0
## 5       0        0         0        0       0       0        0        0
## 6       0        0         0        0       0       1        0        0
##   F_LIMENG F_THEME2 F_MINRTY F_THEME3 F_MUNIT F_MOBILE F_CROWD F_NOVEH F_GROUPQ
## 1        0        0        0        0       0        0       0       0        0
## 2        0        0        0        0       0        0       0       0        1
## 3        0        0        0        0       0        0       0       0        0
## 4        0        1        0        0       0        0       0       0        0
## 5        0        0        0        0       0        0       0       0        0
## 6        0        1        0        0       0        0       0       0        0
##   F_THEME4 F_TOTAL E_DAYPOP E_NOINT M_NOINT E_AFAM M_AFAM E_HISP M_HISP E_ASIAN
## 1        0       0     1033     217     544    235    151     33     34      41
## 2        1       1     4080     301     372   1026    298     30     38       0
## 3        0       0     2056     401     795   1042    499    180    207      44
## 4        0       1     1908     335     712    309    242     17     27      17
## 5        0       0     3774     172     954    658    341     36     64      80
## 6        0       1     3437     376     808    232     94    369    217     278
##   M_ASIAN E_AIAN M_AIAN E_NHPI M_NHPI E_TWOMORE M_TWOMORE E_OTHERRACE
## 1      54      0     12      0     12       128        99           0
## 2      12      0     12      0     12        46        54          14
## 3      42      0     12      0     12        65        85           0
## 4      21     10     17      0     12       101       103           0
## 5      85      0     12      0     12        15        24           0
## 6     212      0     12      0     12        32       221          32
##   M_OTHERRACE EP_NOINT MP_NOINT EP_AFAM MP_AFAM EP_HISP MP_HISP EP_ASIAN
## 1          12     11.2      2.3    12.1     7.1     1.7     1.8      2.1
## 2          19     19.5      3.8    58.4     8.0     1.7     2.1      0.0
## 3          12     10.9      1.7    28.2    10.4     4.9     5.5      1.2
## 4          12      9.5      1.4     8.7     6.5     0.5     0.7      0.5
## 5          12      4.0      0.7    15.3     7.6     0.8     1.5      1.9
## 6         124     12.1      2.1     7.5     3.1    11.9     7.1      9.0
##   MP_ASIAN EP_AIAN MP_AIAN EP_NHPI MP_NHPI EP_TWOMORE MP_TWOMORE EP_OTHERRACE
## 1      2.7     0.0     1.8       0     1.8        6.6        5.1          0.0
## 2      2.0     0.0     2.0       0     2.0        2.6        3.0          0.8
## 3      1.1     0.0     0.9       0     0.9        1.8        2.3          0.0
## 4      0.6     0.3     0.5       0     1.0        2.9        2.8          0.0
## 5      2.0     0.0     0.8       0     0.8        0.3        0.6          0.0
## 6      6.6     0.0     1.1       0     1.1        1.0        7.3          1.0
##   MP_OTHERRACE
## 1          1.8
## 2          1.1
## 3          0.9
## 4          1.0
## 5          0.8
## 6          4.0
econ <- read.csv("ACSDT5Y2020.B19013-Data.csv")
head(econ)
##                 GEO_ID                                          NAME
## 1            Geography                          Geographic Area Name
## 2          0400000US06                                    California
## 3 1400000US06001400100 Census Tract 4001, Alameda County, California
## 4 1400000US06001400200 Census Tract 4002, Alameda County, California
## 5 1400000US06001400300 Census Tract 4003, Alameda County, California
## 6 1400000US06001400400 Census Tract 4004, Alameda County, California
##                                                                                    B19013_001E
## 1 Estimate!!Median household income in the past 12 months (in 2020 inflation-adjusted dollars)
## 2                                                                                        78672
## 3                                                                                       220921
## 4                                                                                       200192
## 5                                                                                       118695
## 6                                                                                       137067
##                                                                                           B19013_001M
## 1 Margin of Error!!Median household income in the past 12 months (in 2020 inflation-adjusted dollars)
## 2                                                                                                 270
## 3                                                                                               25969
## 4                                                                                               23361
## 5                                                                                               23530
## 6                                                                                                5162
##    X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
us_cases <- read.csv('USAFacts/confirmed-covid-19-cases-in-us-by-state-and-county.csv')
head(us_cases)
##   X county_fips           county_name state_name state_fips       date
## 1 0           0 Statewide Unallocated         AL          1 2020-01-22
## 2 1           0 Statewide Unallocated         AL          1 2020-01-23
## 3 2           0 Statewide Unallocated         AL          1 2020-01-24
## 4 3           0 Statewide Unallocated         AL          1 2020-01-25
## 5 4           0 Statewide Unallocated         AL          1 2020-01-26
## 6 5           0 Statewide Unallocated         AL          1 2020-01-27
##   confirmed lat long geometry
## 1         0  NA   NA         
## 2         0  NA   NA         
## 3         0  NA   NA         
## 4         0  NA   NA         
## 5         0  NA   NA         
## 6         0  NA   NA
us_deaths <- read.csv('USAFacts/confirmed-covid-19-deaths-in-us-by-state-and-county.csv')
tail(us_deaths)
##             X county_fips   county_name state_name state_fips       date deaths
## 600655 600654       56045 Weston County         WY         56 2020-07-22      0
## 600656 600655       56045 Weston County         WY         56 2020-07-23      0
## 600657 600656       56045 Weston County         WY         56 2020-07-24      0
## 600658 600657       56045 Weston County         WY         56 2020-07-25      0
## 600659 600658       56045 Weston County         WY         56 2020-07-26      0
## 600660 600659       56045 Weston County         WY         56 2020-07-27      0
##             lat      long                         geometry
## 600655 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600656 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600657 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600658 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600659 43.83961 -104.5675 POINT (-104.5674881 43.83961191)
## 600660 43.83961 -104.5675 POINT (-104.5674881 43.83961191)

Subsetting and renaming column names

Clean_data <- subset(SVI_Data, ST_ABBR == "CA")
CA_SVI <- subset(Clean_data, select = c(STATE,STCNTY,COUNTY,FIPS,EP_POV150,EP_LIMENG,EP_AGE65, EP_MINRTY,   EP_UNEMP,   EP_NOHSDP,EP_AGE17,EP_DISABL,EP_SNGPNT,EP_MUNIT,EP_MOBILE,EP_CROWD,EP_NOVEH,EP_GROUPQ,RPL_THEMES) )

# Renaming columns
colnames(CA_SVI)[colnames(CA_SVI) == 'STCNTY'] <- 'county_fips'
us_cases <- rename(us_cases, confirmed_cases = confirmed)


# Filter out rows with any value equal to -999
CA_SVI <- CA_SVI[rowSums(CA_SVI == -999, na.rm = TRUE) == 0, ]

# Drop rows with missing values
CA_SVI <- na.omit(CA_SVI)


covid19_cases<-us_cases
covid19_deaths<-us_deaths

covid19_cases<-subset(covid19_cases, state_name=='CA')
covid19_deaths<-subset(covid19_deaths, state_name=='CA')

head(covid19_cases)
##           X county_fips           county_name state_name state_fips       date
## 35909 35908           0 Statewide Unallocated         CA          6 2020-01-22
## 35910 35909           0 Statewide Unallocated         CA          6 2020-01-23
## 35911 35910           0 Statewide Unallocated         CA          6 2020-01-24
## 35912 35911           0 Statewide Unallocated         CA          6 2020-01-25
## 35913 35912           0 Statewide Unallocated         CA          6 2020-01-26
## 35914 35913           0 Statewide Unallocated         CA          6 2020-01-27
##       confirmed_cases lat long geometry
## 35909               0  NA   NA         
## 35910               0  NA   NA         
## 35911               0  NA   NA         
## 35912               0  NA   NA         
## 35913               0  NA   NA         
## 35914               0  NA   NA
head(covid19_deaths)
##           X county_fips           county_name state_name state_fips       date
## 35909 35908           0 Statewide Unallocated         CA          6 2020-01-22
## 35910 35909           0 Statewide Unallocated         CA          6 2020-01-23
## 35911 35910           0 Statewide Unallocated         CA          6 2020-01-24
## 35912 35911           0 Statewide Unallocated         CA          6 2020-01-25
## 35913 35912           0 Statewide Unallocated         CA          6 2020-01-26
## 35914 35913           0 Statewide Unallocated         CA          6 2020-01-27
##       deaths lat long geometry
## 35909      0  NA   NA         
## 35910      0  NA   NA         
## 35911      0  NA   NA         
## 35912      0  NA   NA         
## 35913      0  NA   NA         
## 35914      0  NA   NA

Merging covid-19 cases and deaths

merged_data <- merge(covid19_cases, covid19_deaths, by = "county_fips")

# Load the dplyr package
library(dplyr)

# Calculate total cases and deaths for each county
covid19 <- merged_data %>%
  group_by(county_fips) %>%
  mutate(total_cases = sum(confirmed_cases),
         total_deaths = sum(deaths)) %>%
  ungroup()

head(covid19)
## # A tibble: 6 × 21
##   county_fips   X.x county_name.x         state_name.x state_fips.x date.x    
##         <int> <int> <chr>                 <chr>               <int> <chr>     
## 1           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 2           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 3           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 4           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 5           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 6           0 35908 Statewide Unallocated CA                      6 2020-01-22
## # ℹ 15 more variables: confirmed_cases <int>, lat.x <dbl>, long.x <dbl>,
## #   geometry.x <chr>, X.y <int>, county_name.y <chr>, state_name.y <chr>,
## #   state_fips.y <int>, date.y <chr>, deaths <int>, lat.y <dbl>, long.y <dbl>,
## #   geometry.y <chr>, total_cases <int>, total_deaths <int>

Calculating total deaths and cases for each county

# Calculate total cases and deaths for each county
covid19 <- covid19 %>%
  group_by(county_fips) %>%
  mutate(total_cases = sum(confirmed_cases),
         total_deaths = sum(deaths)) %>%
  ungroup()

head(covid19)
## # A tibble: 6 × 21
##   county_fips   X.x county_name.x         state_name.x state_fips.x date.x    
##         <int> <int> <chr>                 <chr>               <int> <chr>     
## 1           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 2           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 3           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 4           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 5           0 35908 Statewide Unallocated CA                      6 2020-01-22
## 6           0 35908 Statewide Unallocated CA                      6 2020-01-22
## # ℹ 15 more variables: confirmed_cases <int>, lat.x <dbl>, long.x <dbl>,
## #   geometry.x <chr>, X.y <int>, county_name.y <chr>, state_name.y <chr>,
## #   state_fips.y <int>, date.y <chr>, deaths <int>, lat.y <dbl>, long.y <dbl>,
## #   geometry.y <chr>, total_cases <int>, total_deaths <int>

Creating a new dataframe with total cases and deaths for each county

# Create a new dataframe with total cases and deaths for each county
summary_data <- covid19 %>%
  group_by(county_fips) %>%
  summarize(total_cases = sum(confirmed_cases),
            total_deaths = sum(deaths))

# Display the new dataframe
print(summary_data)
## # A tibble: 60 × 3
##    county_fips total_cases total_deaths
##          <int>       <int>        <int>
##  1           0        6016            0
##  2        6000      564564            0
##  3        6001    87753324      2084168
##  4        6003       43428            0
##  5        6005      411344            0
##  6        6007     3079064        27260
##  7        6009      631680         1316
##  8        6011      876644         2444
##  9        6013    47864988      1044716
## 10        6015      706128            0
## # ℹ 50 more rows

Merging SVI data and covid-19 data using county_fips

#merging svi and summary_data
svi_covid <- merge(CA_SVI, summary_data, by = "county_fips", all.x = TRUE)
head(svi_covid)
##   county_fips      STATE  COUNTY       FIPS EP_POV150 EP_LIMENG EP_AGE65
## 1        6001 California Alameda 6001400100       6.8       0.0     25.3
## 2        6001 California Alameda 6001400200       7.0       1.3     21.7
## 3        6001 California Alameda 6001400300       8.6       0.3     16.4
## 4        6001 California Alameda 6001400400      12.0       1.3     10.8
## 5        6001 California Alameda 6001400500      12.8       1.1     15.1
## 6        6001 California Alameda 6001400600      10.7       0.9     10.5
##   EP_MINRTY EP_UNEMP EP_NOHSDP EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE
## 1      27.3      1.0       2.7     18.6      11.8       6.2      0.0       0.0
## 2      32.8      7.9       1.1     16.3       8.9       2.7     12.9       0.0
## 3      39.0      3.4       4.5     17.0      11.4       6.3     24.2       0.4
## 4      35.5      2.5       4.3     21.1       8.3       5.6      7.8       0.0
## 5      55.3      6.4       2.6     11.5       6.1       2.0     12.1       0.0
## 6      56.1      5.6       2.4     15.3       7.9       5.8      5.5       0.0
##   EP_CROWD EP_NOVEH EP_GROUPQ RPL_THEMES total_cases total_deaths
## 1      0.0      4.4       0.0     0.0274    87753324      2084168
## 2      0.7     10.4       4.9     0.2744    87753324      2084168
## 3      2.1     15.0       0.8     0.4551    87753324      2084168
## 4      0.8      9.5       1.1     0.2426    87753324      2084168
## 5      2.6      7.8       0.0     0.2332    87753324      2084168
## 6      0.0      7.5       0.5     0.1700    87753324      2084168

cleaning econimc data

#clean data for economic
econ <- subset(econ, select = c(GEO_ID, NAME,B19013_001E))
econ <- econ[-c(1, 2), ] 


names_to_change <- c("GEO_ID", "NAME", "B19013_001E")
new_names <- c("GEO_ID", "tract", "income")

econ <- setNames(econ, new_names)


econ$GEO_ID <- sub(".*US0*(\\d+)", "\\1", econ$GEO_ID)

svi_econ <- merge(econ, CA_SVI, by.x = "GEO_ID", by.y = "FIPS", all.x = TRUE, all.y = TRUE)


total_na_count <- sum(is.na(svi_econ))
print(total_na_count)
## [1] 1530
#remove NA

svi_econ <- na.omit(svi_econ)
svi_econ <- svi_econ[svi_econ$income != "-", , drop = FALSE]


#leaves us with 9001 rows 
str(svi_econ$income)
##  chr [1:9001] "220921" "200192" "118695" "137067" "110052" "135682" "94700" ...
svi_econ$income <- as.numeric(as.character(svi_econ$income))
total_na_count <- sum(is.na(svi_econ))
print(total_na_count)
## [1] 45
svi_econ <- na.omit(svi_econ)
#8956
merged_data1 <- svi_econ %>%
  left_join(summary_data, by = c("county_fips" = "county_fips"))



head(merged_data1)
##       GEO_ID                                         tract income      STATE
## 1 6001400100 Census Tract 4001, Alameda County, California 220921 California
## 2 6001400200 Census Tract 4002, Alameda County, California 200192 California
## 3 6001400300 Census Tract 4003, Alameda County, California 118695 California
## 4 6001400400 Census Tract 4004, Alameda County, California 137067 California
## 5 6001400500 Census Tract 4005, Alameda County, California 110052 California
## 6 6001400600 Census Tract 4006, Alameda County, California 135682 California
##   county_fips  COUNTY EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY EP_UNEMP EP_NOHSDP
## 1        6001 Alameda       6.8       0.0     25.3      27.3      1.0       2.7
## 2        6001 Alameda       7.0       1.3     21.7      32.8      7.9       1.1
## 3        6001 Alameda       8.6       0.3     16.4      39.0      3.4       4.5
## 4        6001 Alameda      12.0       1.3     10.8      35.5      2.5       4.3
## 5        6001 Alameda      12.8       1.1     15.1      55.3      6.4       2.6
## 6        6001 Alameda      10.7       0.9     10.5      56.1      5.6       2.4
##   EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE EP_CROWD EP_NOVEH EP_GROUPQ
## 1     18.6      11.8       6.2      0.0       0.0      0.0      4.4       0.0
## 2     16.3       8.9       2.7     12.9       0.0      0.7     10.4       4.9
## 3     17.0      11.4       6.3     24.2       0.4      2.1     15.0       0.8
## 4     21.1       8.3       5.6      7.8       0.0      0.8      9.5       1.1
## 5     11.5       6.1       2.0     12.1       0.0      2.6      7.8       0.0
## 6     15.3       7.9       5.8      5.5       0.0      0.0      7.5       0.5
##   RPL_THEMES total_cases total_deaths
## 1     0.0274    87753324      2084168
## 2     0.2744    87753324      2084168
## 3     0.4551    87753324      2084168
## 4     0.2426    87753324      2084168
## 5     0.2332    87753324      2084168
## 6     0.1700    87753324      2084168
svi_econ_covid=merged_data1

Creating a new column that has income range

# Create a new column 'income_r' based on specified income thresholds
svi_econ_covid$income_r <- cut(svi_econ_covid$income, 
                               breaks = c(-Inf, 50000, 100000, 200000, Inf),
                               labels = c("income<50000", "50000<income<100000", "100000<income<200000", "income>200000"),
                               include.lowest = TRUE)

# View the updated dataset
head(svi_econ_covid)
##       GEO_ID                                         tract income      STATE
## 1 6001400100 Census Tract 4001, Alameda County, California 220921 California
## 2 6001400200 Census Tract 4002, Alameda County, California 200192 California
## 3 6001400300 Census Tract 4003, Alameda County, California 118695 California
## 4 6001400400 Census Tract 4004, Alameda County, California 137067 California
## 5 6001400500 Census Tract 4005, Alameda County, California 110052 California
## 6 6001400600 Census Tract 4006, Alameda County, California 135682 California
##   county_fips  COUNTY EP_POV150 EP_LIMENG EP_AGE65 EP_MINRTY EP_UNEMP EP_NOHSDP
## 1        6001 Alameda       6.8       0.0     25.3      27.3      1.0       2.7
## 2        6001 Alameda       7.0       1.3     21.7      32.8      7.9       1.1
## 3        6001 Alameda       8.6       0.3     16.4      39.0      3.4       4.5
## 4        6001 Alameda      12.0       1.3     10.8      35.5      2.5       4.3
## 5        6001 Alameda      12.8       1.1     15.1      55.3      6.4       2.6
## 6        6001 Alameda      10.7       0.9     10.5      56.1      5.6       2.4
##   EP_AGE17 EP_DISABL EP_SNGPNT EP_MUNIT EP_MOBILE EP_CROWD EP_NOVEH EP_GROUPQ
## 1     18.6      11.8       6.2      0.0       0.0      0.0      4.4       0.0
## 2     16.3       8.9       2.7     12.9       0.0      0.7     10.4       4.9
## 3     17.0      11.4       6.3     24.2       0.4      2.1     15.0       0.8
## 4     21.1       8.3       5.6      7.8       0.0      0.8      9.5       1.1
## 5     11.5       6.1       2.0     12.1       0.0      2.6      7.8       0.0
## 6     15.3       7.9       5.8      5.5       0.0      0.0      7.5       0.5
##   RPL_THEMES total_cases total_deaths             income_r
## 1     0.0274    87753324      2084168        income>200000
## 2     0.2744    87753324      2084168        income>200000
## 3     0.4551    87753324      2084168 100000<income<200000
## 4     0.2426    87753324      2084168 100000<income<200000
## 5     0.2332    87753324      2084168 100000<income<200000
## 6     0.1700    87753324      2084168 100000<income<200000

Decision tree using SVI, Economy and Covid-19 datasets

#Decision tree using income, covid and other variables

library(rpart)
library(rpart.plot)

# Set a seed for reproducibility
set.seed(123)

# Split the data into a training set (80%) and a test set (20%)
splitIndex <- createDataPartition(svi_econ_covid$RPL_THEMES, p = 0.8, list = FALSE)
train_data <- svi_econ_covid[splitIndex, ]
test_data <- svi_econ_covid[-splitIndex, ]



#decision tree model
tree_model <- rpart(RPL_THEMES ~ total_cases + total_deaths + income_r + EP_AGE65 + EP_UNEMP+ EP_NOHSDP +EP_POV150, data = train_data)

# Make predictions on the test set
tree_predictions <- predict(tree_model, newdata = test_data)

# RMSE for the decision tree
tree_rmse <- sqrt(mean((test_data$RPL_THEMES - tree_predictions)^2))
print(paste("Decision Tree RMSE:", tree_rmse))
## [1] "Decision Tree RMSE: 0.138598526752175"
library("rattle") # For fancyRpartPlot (Trees) Answer "no" on installing from binary source
fancyRpartPlot(tree_model)

The regression tree model tries to delve into the intricate relationships between various factors and the Social Vulnerability Index (SVI), represented by the ‘RPL_THEMES.’ and other variables from covid and Economic datasets. The decision tree, rooted in the EP_NOHSDP variable, provides valuable insights into the nuanced predictors of SVI.

The tree distinctly identifies EP_NOHSDP as a pivotal factor in the initial split, emphasizing the critical role of educational attainment in predicting SVI. Further branching into EP_POV150 and income_r delineates the socio-economic dimension of vulnerability. Notably, individuals with lower EP_NOHSDP scores and lower EP_POV150 scores are predicted to have lower SVI values, indicating a correlation between educational attainment, poverty levels, and social vulnerability

Interestingly, the COVID-19-related variables, total_cases and total_deaths, do not prominently feature in the decision tree. This may suggest that, within the scope of our model, these variables may not be the primary drivers of social vulnerability when considered alongside educational attainment, poverty, and income. The prominence of income_r in the tree underscores the multifaceted nature of vulnerability, where refined income measures contribute significantly to the prediction of SVI.

This decision tree provides a visually interpretable framework for understanding the complex interplay of factors influencing social vulnerability. As we present these findings, the nuanced relationships highlighted by the tree offer actionable insights for targeted interventions and policy measures, particularly focusing on education, income, and poverty alleviation in the context of social vulnerability.

Random Forest Model to identify significance of variables 3 datasets

#significant variables for predicting Total SVI (RPL_THEMES

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# Model
rf_model <- randomForest(RPL_THEMES ~ ., data = svi_econ_covid)

# feature importance
print(rf_model$importance)
##              IncNodePurity
## GEO_ID            2.014482
## tract             2.521926
## income          110.422326
## STATE             0.000000
## county_fips       1.048596
## COUNTY            1.037469
## EP_POV150       153.409257
## EP_LIMENG        37.025773
## EP_AGE65          3.499091
## EP_MINRTY        18.348359
## EP_UNEMP          9.061591
## EP_NOHSDP       176.855275
## EP_AGE17          4.863257
## EP_DISABL        10.220366
## EP_SNGPNT        17.229356
## EP_MUNIT         11.194503
## EP_MOBILE         6.809919
## EP_CROWD         57.304590
## EP_NOVEH         18.076946
## EP_GROUPQ        14.023930
## total_cases       1.323994
## total_deaths      1.305790
## income_r         62.531463

The Random Forest model tries to discern the crucial features influencing the Social Vulnerability Index (SVI). The model identified key predictors by evaluating their impact on the prediction accuracy and purity of decision tree nodes. Notably, the most influential features include ‘EP_NOHSDP,’ representing the percentage of individuals without a high school diploma, and ‘EP_POV150,’ indicating the percentage of individuals living below 150% of the poverty line. These socio-economic factors prominently contribute to SVI, underscoring the importance of educational attainment and economic well-being in vulnerability assessment.

Additionally, ‘income_r,’ a refined measure of income, emerged as a significant predictor, highlighting the nuanced role of income levels in shaping social vulnerability. The ‘EP_CROWD’ feature, capturing the percentage of households with crowded living conditions, signifies the impact of housing density on vulnerability. Furthermore, ‘EP_LIMENG,’ representing the percentage of individuals with limited English proficiency, and ‘EP_UNEMP,’ indicating the percentage of unemployed individuals, underscore the importance of language and employment status in vulnerability modeling

Our analysis extends beyond traditional demographic factors, incorporating ‘total_cases’ and ‘total_deaths’ as indicators of the COVID-19 impact. These variables dosen’t have much significance in predicting the total SVI. As we present our findings, these identified features provide actionable insights for policymakers and practitioners seeking to address and mitigate social vulnerability in diverse communities